2. Installing and running Heritrix

This chapter will explain how to set up Heritrix.

Because Heritrix is a pure Java program it can (in theory anyway) be run on any platform that has a Java 5.0 VM. However we are only committed to supporting its operation on Linux and so this chapter only covers setup on that platform. Because of this, what follows assumes basic Linux administration skills. Other chapters in the user manual are platform agnostic.

This chapter also only covers installing and running the prepackaged binary distributions of Heritrix. For information about downloading and compiling the source see the Developer's Manual.

2.1. Obtaining and installing Heritrix

The packaged binary can be downloaded from the project's sourceforge home page. Each release comes in four flavors, packaged as .tar.gz or .zip and including source or not.

For installation on Linux get the file heritrix-?.?.?.tar.gz (where ?.?.? is the most recent version number).

The packaged binary comes largely ready to run. Once downloaded it can be untarred into the desired directory.

  % tar xfz heritrix-?.?.?.tar.gz

Once you have downloaded and untarred the correct file you can move on to the next step.

2.1.1. System requirements

2.1.1.1. Java Runtime Environment

The Heritrix crawler is implemented purely in Java. This means that the only true requirement for running it is that you have a JRE installed (Building will require a JDK).

The Heritrix crawler, since release 1.10.0, makes use of Java 5.0 features so your JRE must be at least of a 5.0 (1.5.0+) pedigree.

We currently include all of the free/open source third-party libraries necessary to run Heritrix in the distribution package. See dependencies for the complete list (Licenses for all of the listed libraries are listed in the dependencies section of the raw project.xml at the root of the src download or on Sourceforge).

2.1.1.1.1. Installing Java

If you do not have Java installed you can download Java from:

2.1.1.2. Hardware

A default java heap is 256MB RAM, which is usually suitable for crawls that range over hundreds of hosts. Assign more -- see Section 2.2.1.3, “JAVA_OPTS” for how -- of your available RAM to the heap if you are crawling thousands of hosts or experience Java out-of-memory problems.

2.1.1.3. Linux

The Heritrix crawler has been built and tested primarily on Linux. It has seen some informal use on Macintosh, Windows 2000 and Windows XP, but is not tested, packaged, nor supported on platforms other than Linux at this time.

2.2. Running Heritrix

To run Heritrix, first do the following:

  % export HERITRIX_HOME=/PATH/TO/BUILT/HERITRIX
...where $HERITRIX_HOME is the location of your untarred heritrix.?.?.?.tar.gz.

Next run:

  % cd $HERITRIX_HOME
  % chmod u+x $HERITRIX_HOME/bin/heritrix
  % $HERITRIX_HOME/bin/heritrix --help
This should give you usage output like the following:

  Usage: heritrix --help
  Usage: heritrix --nowui ORDER.XML
  Usage: heritrix [--port=#] [--run] [--bind=IP,IP...] --admin=LOGIN:PASSWORD \
      [ORDER.XML]
  Usage: heritrix [--port=#] --selftest[=TESTNAME]
  Version: @VERSION@
  Options:
   -b,--bind       Comma-separated list of IP addresses or hostnames for web
                   server to listen on.  Set to / to listen on all available
                   network interfaces.  Default is 127.0.0.1.
   -a,--admin      Login and password for web user interface administration.
                   Required (unless passed via the 'heritrix.cmdline.admin'
                   system property).  Pass value of the form 'LOGIN:PASSWORD'.
   -h,--help       Prints this message and exits.
   -n,--nowui      Put heritrix into run mode and begin crawl using ORDER.XML. Do
                   not put up web user interface.
   -p,--port       Port to run web user interface on.  Default: 8080.
   -r,--run        Put heritrix into run mode. If ORDER.XML begin crawl.
   -s,--selftest   Run the integrated selftests. Pass test name to test it only
                   (Case sensitive: E.g. pass 'Charset' to run charset selftest).
  Arguments:
   ORDER.XML       Crawl order to run.
Launch the crawler with the UI enabled by doing the following:

  % $HERITRIX_HOME/bin/heritrix --admin=LOGIN:PASSWORD
This will start up Heritrix printing out a startup message that looks like the following:

  [b116-dyn-60 619] heritrix-0.4.0 > ./bin/heritrix
  Tue Feb 10 17:03:01 PST 2004 Starting heritrix...
  Tue Feb 10 17:03:05 PST 2004 Heritrix 0.4.0 is running.
  Web UI is at: http://b116-dyn-60.archive.org:8080/admin
  Login and password: admin/letmein

Note

By default, as of version 1.10.x, Heritrix binds to localhost only. This means that you need to be running Heritrix on the same machine as your browser to access the Heritrix UI. Read about the --bind argument above if you need to access the Heritrix UI over a network.

See Section 3, “Web based user interface” and Section 4, “A quick guide to running your first crawl job” to get your first crawl up and running.

2.2.1. Environment variables

Below are environment variables that effect Heritrix operation.

2.2.1.1. HERITRIX_HOME

Set this environment variable to point at the Heritrix home directory. For example, if you've unpacked Heritrix in your home directory and Heritrix is sitting in the heritrix-1.0.0 directory, you'd set HERITRIX_HOME as follows. Assuming your shell is bash:

  % export HERITRIX_HOME=~/heritrix-1.0.0
If you don't set this environment variable, the Heritrix start script makes a guess at the home for Heritrix. It doesn't always guess correctly.

2.2.1.2. JAVA_HOME

This environment variable may already exist. It should point to the Java installation on the machine. An example of how this might be set (assuming your shell is bash):

  % export JAVA_HOME=/usr/local/java/jre/

2.2.1.3. JAVA_OPTS

Pass options to the Heritrix JVM by populating the JAVA_OPTS environment variable with values. For example, if you want to have Heritrix run with a larger heap, say 512 megs, you could do either of the following (assuming your shell is bash):

  % export JAVA_OPTS="-Xmx512M"
% $HERITRIX_HOME/bin/heritrix
Or, you could do it all on the one line as follows:
  % JAVA_OPTS="-Xmx512m" $HERITRIX_HOME/bin/heritrix

2.2.2. System properties

Below we document the system properties passed on the command-line that can influence Heritrix's behavior. If you are using the /bin/heritrix script to launch Heritrix you may have to edit it to change/set these properties or else pass them as part of JAVA_OPTS.

2.2.2.1. heritrix.properties

Set this property to point at an alternate heritrix.properties file -- e.g.: -Dheritrix.properties=/tmp/alternate.properties -- when you want heritrix to use a properties file other than that found at conf/heritrix.properties.

2.2.2.2. heritrix.context

Provide an alternate context for the Heritrix admin UI. Usually the admin webapp is mounted on root: i.e. '/'.

2.2.2.3. heritrix.development

Set this property when you want to run the crawler from eclipse. This property takes no arguments. When this property is set, the conf and webapps directories will be found in their development locations and startup messages will show on the text console (standard out).

2.2.2.4. heritrix.home

Where heritrix is homed usually passed by the heritrix launch script.

2.2.2.5. heritrix.out

Where stdout/stderr are sent, usually heritrix_out.log and passed by the heritrix launch script.

2.2.2.6. heritrix.version

Version of heritrix set by the heritrix build into heritrix.properties.

2.2.2.7. heritrix.jobsdir

Where to drop heritrix jobs. Usually empty. Default location is ${HERITRIX_HOME}/jobs.

2.2.2.8. heritrix.conf

Specify an alternate configuration directory other than the default $HERITRIX_HOME/conf.

2.2.2.9. heritrix.cmdline

This set of system properties are rarely used. They are for use when Heritrix has NOT been started from the command-line -- e.g. its been embedded in another application -- and the startup configuration that is set usually by command-line options, instead needs to be done via system properties alone.

2.2.2.9.1. heritrix.cmdline.admin

Value is a colon-delimited String user name and password for admin GUI

2.2.2.9.2. heritrix.cmdline.nowui

If set to true, will prevent embedded web server crawler control interface from starting up.

2.2.2.9.3. heritrix.cmdline.order

If set to to a string file path, will use the specified crawl order XML file.

2.2.2.9.4. heritrix.cmdline.port

Value is the port to run the GUI on.

2.2.2.9.5. heritrix.cmdline.run

If true, crawler is set into run mode on startup.

2.2.2.10. javax.net.ssl.trustStore

Heritrix has its own trust store at conf/heritrix.cacerts that it uses if the FetcherHTTP is configured to use a trust level of other than open (open is the default setting). In the unusual case where you'd like to have Heritrix use an alternate truststore, point at the alternate by supplying the JSSE javax.net.ssl.trustStore property on the command line: e.g.

2.2.2.11. java.util.logging.config.file

The Heritrix conf directory includes a file named heritrix.properties. A section of this file specifies the default Heritrix logging configuration. To override these settings, point java.util.logging.config.file at a properties file with an alternate logging configuration. Below we reproduce the default heritrix.properties for reference:

  # Basic logging setup; to console, all levels
handlers= java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level= ALL

# Default global logging level: only warnings or higher
.level= WARNING

# currently necessary (?) for standard logs to work
crawl.level= INFO
runtime-errors.level= INFO
uri-errors.level= INFO
progress-statistics.level= INFO
recover.level= INFO

# HttpClient is too chatty... only want to hear about severe problems
org.apache.commons.httpclient.level= SEVERE
Here's an example of how you might specify an override:
  % JAVA_OPTS="-Djava.util.logging.config.file=heritrix.properties" \
      ./bin/heritrix --no-wui order.xml

Alternatively you could edit the default file.

2.2.2.12. java.io.tmpdir

Specify an alternate tmp directory. Default is /tmp.

2.2.2.13. com.sun.management.jmxremote.port

What port to start up JMX Agent on. Default is 8849. See also the environment variable JMX_PORT.

2.3. Security Considerations

The crawler is a large and active network application which presents security implications, both local to the machine where it operates, and remotely for machines it contacts.

2.3.1. Local to the Crawling Machine

It is important to recognize that the web UI (discussed in Section 3, “Web based user interface”) and JMX agent (discussed in Section 9.5, “Remote Monitoring and Control”) allow remote control of the crawler process in ways that might potentially disrupt a crawl, change the crawler's behavior, read or write locally-accessible files, and perform or trigger other actions in the Java VM or local machine.

The administrative login and password are currently only a very mild protection against unauthorized access, unless you take additional steps to prevent access to the crawler machine. We strongly recommend some combination of the following practices:

First, use network configuration tools, like a firewall, to only allow trusted remote hosts to contact the web UI and, if applicable, JMX agent ports. (The default web UI port is 8080; JMX is 8849.)

Second, use a strong and unique username/password combination to secure the web UI and JMX agent. However, keep in mind that the default administrative web server uses plain HTTP for access, so these values are susceptible to eavesdropping in transit if network links between your browser and the crawler are compromised. (An upcoming update will change the default to HTTPS.) Also, setting the username/password on the command-line may result in their values being visible to other users of the crawling machine, and they are additionally printed to the console and heritrix_out.log for operator reference.

Third, run the crawler as a user with the minimum privileges necessary for its operation, so that in the event of unauthorized access to the web UI or JMX agent, the potential damage is limited.

Successful unauthorized access to the web UI or JMX agent could trivially end or corrupt a crawl, or change the crawler's behavior to be a nuisance to other network hosts. By adjusting configuration paths, unauthorized access could potentially delete, corrupt, or replace files accessible to the crawler process, and thus cause more extensive problems on the crawler machine.

Another potential risk is that some worst-case or maliciously-crafted crawled content might, in combination with crawler bugs, disrupt the crawl or other files or operations of the local system. For example, in the past, even without malicious intent, some rich-media content has caused runaway memory use in 3rd-party libraries used by the crawler, resulting in a memory-exhaustion condition that can stop or corrupt a crawl in progress. Similarly, atypical input patterns have at times caused runaway CPU use by crawler link-extraction regular expressions, severely slowing crawls. Crawl operators should monitor their crawls closely and stay informed via the project discussion list and bug database for any newly discovered similar bugs.