Heritrix can be obtained as packaged binary or source downloaded
from the crawler sourceforge home
page, or via checkout from archive-crawler.svn.sourceforge.net. See the
crawler sourceforge svn
page for how to fetch from subversion. The Module Name
name to use checking out heritrix is
ArchiveOpenCrawler
, the name Heritrix had before it
was called Heritrix.
Note, anonymous access does not give you the current HEAD but a snapshot that can some times be up to 24 hours behind HEAD.
The packaged binary is named heritrix-?.?.?.tar.gz (or heritrix-?.?.?.zip) and the packaged source is named heritrix-?.?.?-src.tar.gz (or heritrix-?.?.?-src.zip) where ?.?.? is the heritrix release version.
You can build Heritrix from source using Maven. Heritrix build has been tested against maven-1.0.2. Do not use Maven 2.x to build Heritrix. See maven.apache.org for how to obtain the binary and setup of your maven environment.
In addition to the base maven build, if you want to generate the docbook user and developer manuals, you will need to add the maven sdocbook plugin which can be found at this page (If the sdocbook plugin is not present, the build skips the docbook manual generation). Be careful. Do not confuse the 'sdocbook' plugin with the similarly named 'docbook' plugin. This latter converts docbook to xdocs where what's wanted is the former, convert docbook xml to html. This 'sdocbook' plugin is used to generate the user and developer documentation.
Download the plugin jar -- currently, as of this writing, its
maven-sdocbook-plugin-1.4.1.jar --
and put it into your maven repository plugins directory, usually
at ${MAVEN_HOME}/plugins/
(in earlier
versions of maven, pre 1.0.2, plugins are at
${HOME}/.maven/plugins/
).
The sdocbook plugin has a dependency on the jimi jar from sun which
you will have to manually pull down and place into your maven
respository (Its got a sun license you must
accept so maven cannot autodownload).
Download the jimi package and unzip it. Rename the
file named JimiProClasses.zip
as jimi-1.0.jar
and put it into
your maven jars repository (Usually
.maven/repository/jimi/jars
. You
may have to create the later directories manually).
Maven will be looking for a jar named
jimi-1.0.jar. Thats why you have to rename the
jimi class zip (jars are effectively zips).
It may be necessary to alter the sdocbook-plugin
default configuration. By default, sdocbook will download the latest
version of docbook-xsl
. However, sdocbook hardcodes
a specific version number for docbook-xsl
in its
plugin.properties
file. If you get an error like
"Error while expanding ~/.maven/repository/docbook/zips/docbook-xsl-1.66.1.zip",
then you will have to manually edit sdocbook's properties. First determine
the version of docbook-xsl
that you have -- it's in
~/.maven/repository/docbook/zips
. Once you have the
version number, edit ~/.maven/cache/maven-sdocbook-plugin-1.4/plugin-properties
and change the maven.sdocbook.stylesheets.version
property
to the version that was actually downloaded.
To build a source checkout with Maven:
% cd CHECKOUT_DIR % $MAVEN_HOME/bin/maven distIn the target/distribution subdir, you will find packaged source and binary builds. Run $MAVEN_HOME/bin/maven -g for other Maven possibilities.
See the User Manual [Heritrix User Guide] for how to run the built Heritrix.
The development team uses Eclipse as the development environment. This is of course optional, but for those who want to use Eclipse, you can, at the head of the source tree, find Eclipse .project and .classpath configuration files that should make integrating the source checkout into your Eclipse development environment straight-forward.
When running direct from checkout directories, rather than a Maven build, be sure to use a JDK installation (so that JSP pages can compile). You will probably also want to set the 'heritrix.development' property (with the "-Dheritrix.development" VM command-line option) to indicate certain files are in their development, rather than deployment, locations.
Run the integration self test on the command line by doing the following:
% $HERITRIX_HOME/bin/heritrix --selftestThis will set the crawler going against itself, in particular, the selftest webapp. When done, it runs an analysis of the produced arc files and logs and dumps a ruling into
heritrix_out.log
. See
the org.archive.crawler.selftest
package for more on how the selftest works.