9. Outside the user interface

While it is possible to do a great many things via Heritrix's WUI it is worth taking a look at some of what is not available in it.

9.1. Generated files

In addition to the logs discussed above (see Section 8.2, “Logs”) the following files are generated. Some of the information in them is also available via the WUI.

9.1.1. heritrix_out.log

Captures what is written to the standard out and standard error streams of the program. Mostly this consists of low level exceptions (usually indicative of bugs) and also some information from third party modules who do their own output logging.

This file is created in the same directory as the Heritrix JAR file. It is not associated with any one job, but contains output from all jobs run by the crawler.

9.1.2. crawl-manifest.txt

A manifest of all files (excluding ARC and other data files) created while crawling a job.

An example of this file might be:

  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/crawl.log
  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/runtime-errors.log
  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/local-errors.log
  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/uri-errors.log
  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/progress-statistics.log
  L- /Heritrix/jobs/quickbroad-20040420191411593/disk/recover.gz
  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/seeds-report.txt
  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/hosts-report.txt
  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/mimetype-report.txt
  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/responsecode-report.txt
  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/crawl-report.txt
  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/processors-report.txt
  C+ /Heritrix/jobs/quickbroad-20040420191411593/job-quickbroad.xml
  C+ /Heritrix/jobs/quickbroad-20040420191411593/settings/org/settings.xml
  C+ /Heritrix/jobs/quickbroad-20040420191411593/seeds-quickbroad.txt

The first character of each line indicates the type of file. L for logs, R for reports and C for configuration files.

The second character - a plus or minus sign - indicates if the file should be included in a standard bundle of the job (see Section 9.2.1, “manifest_bundle.pl”). In the example above the recover.gz is marked for exclusion because it is generally only of interest if the job crashes and must be restarted. It has negligible value once the job is completed (See Section 9.3, “Recovery of Frontier State and recover.gz”).

After this initial legend the filename with full path follows.

This file is generated in the directory indicated by the 'disk' attribute of the configuration at the very end of the crawl.

9.1.3. crawl-report.txt

Contains some useful metrics about the completed jobs. This report is created by the StatisticsTracker (see Section 6.1.4, “Statistics Tracking”)

Written at the very end of the crawl only. See crawl-manifest.txt for its location.

9.1.4. hosts-report.txt

Contains an overview of what hosts were crawled and how many documents and bytes were downloaded from each.

This report is created by the StatisticsTracker (see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl only. See crawl-manifest.txt for its location.

9.1.5. mimetype-report.txt

Contains on overview of the number of documents downloaded per mime type. Also has the amount of data downloaded per mime type.

This report is created by the StatisticsTracker (see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl only. See crawl-manifest.txt for its location.

9.1.6. processors-report.txt

Contains the processors report (see Section 7.3.1.3, “Processors report”) generated at the very end of the crawl.

9.1.7. responsecode-report.txt

Contains on overview of the number of documents downloaded per status code (see Status codes), covers successful codes only, does not tally failures, see crawl.log for that information.

This report is created by the StatisticsTracker (see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl only. See crawl-manifest.txt for its location.

9.1.8. seeds-report.txt

An overview of the crawling of each seed. Did it succeed or not, what status code was returned.

This report is created by the StatisticsTracker (see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl only. See crawl-manifest.txt for its location.

9.1.9. ARC files

Assuming that you are using the ARC writer that comes with Heritrix a number of ARC files will be generated containing the crawled pages.

It is possible to specify the location of these files on the ARCWriter processor in settings page. Unless this is set as an absolute path this is a path relative to the job directory.

ARC files are named as follows:

  [prefix]-[12-digit-timestamp]-[series#-padded-to-5-digits]-[crawler-hostname].arc.gz

The prefix is set by the user when he configures the ARCWriter processor. By default it is IAH.

If you see an ARC file with an extra .open suffix, this means the ARC is currently in use being written to by Heritrix (It usually has more than one ARC open at a time).

Files with a .invalid are files Heritrix had trouble writing to (Disk full, bad disk, etc.). On IOException, Heritrix closes the problematic ARC and gives it the .invalid suffix. These files need to be checked for coherence.

For more on ARC files refer to the ARCwriter Javadoc and to the ARC Writer developer documentation.

9.2. Helpful scripts

Heritrix comes bundled with a few helpful scripts for Linux.

9.2.1. manifest_bundle.pl

This script will bundle up all resources referenced by a crawl manifest file (Section 9.1.2, “crawl-manifest.txt”). Output bundle is an uncompressed or compressed tar ball. Directory structure created in the tar ball is as follow:

  • Top level directory (crawl name)

  • Three default subdirectories (configuration, logs and reports directories)

  • Any other arbitrary subdirectories

Usage:

  manifest_bundle.pl crawl_name manifest_file [-f output_tar_file] [-z] [ -flag directory]
      -f output tar file. If omitted output to stdout.
      -z compress tar file with gzip.
      -flag is any upper case letter. Default values C, L, and are R are set to
       configuration, logs and reports

Example:

  manifest_bundle.pl testcrawl crawl-manifest.txt -f    \
        /0/testcrawl/manifest-bundle.tar.gz -z -F filters

Produced tar ball for this example:

  /0/testcrawl/manifest-bundle.tar.gz
Bundled directory structure for this example:
  |-testcrawl
      |- configurations
      |- logs
      |- reports
      |- filters

9.2.2. hoppath.pl

This perl script, found in $HERITRIX_HOME/bin recreates the hop path to the specified url. The hop path is a path of links (URLs) that we followed to get to the specified url.

Usage:

Usage: hoppath.pl crawl.log URI_PREFIX
  crawl.log    Full-path to Heritrix crawl.log instance.
  URI_PREFIX   URI we're querying about. Must begin 'http(s)://' or 'dns:'.
               Wrap this parameter in quotes to avoid shell interpretation
               of any '&' present in URI_PREFIX.

Example:

% hoppath.pl crawl.log 'http://www.house.gov/'

Result:

  2004-02-25-02-36-06 - http://www.house.gov/house/MemberWWW_by_State.html
  2004-02-25-02-36-06   L http://wwws.house.gov/search97cgi/s97_cgi
  2004-02-25-03-30-38    L http://www.house.gov/

The L in the above example refers to the type of link followed (see Discovery path).

9.2.3. RecoverLogMapper

org.archive.crawler.util.RecoveryLogMapper is similar to Section 9.2.2, “hoppath.pl”. It was contributed by Mike Schwartz. RecoverLogMapper parses a Heritrix recovery log file, (See Section 9.3, “Recovery of Frontier State and recover.gz”), and builds maps that allow a caller to look up any seed URL and get back an Iterator of all URLs successfully crawled from given seed. It also allows lookup on any crawled URL to find the seed URL from which the crawler reached that URL (through 1 or more discovered URL hops, which are collapsed in this lookup).

9.2.4. cmdline-jmxclient

This jar file is checked in as a script. It enables command-line control of Heritrix if Heritrix has been started up inside of a SUN 1.5.0 JDK. See the cmdline-jmxclient project to learn more about this script's capabilities and how to use it. See also Section 9.5, “Remote Monitoring and Control”.

9.3. Recovery of Frontier State and recover.gz

During normal running, the Heritrix Frontier by default keeps a journal. The journal is kept in the jobs logs directory. Its named recover.gz. If a crawl crashes, the recover.gz journal can be used to recreate approximately the status of the crawler at the time of the crash. Recovery can take a long time in some cases, but is usually much quicker then repeating a crawl.

To run the recovery process, relaunch the crashed crawler. Create a new crawl order job based on the crawl that crashed. If you choose the "recover-log" link from the list of completed jobs in the 'Based on recovery' page, the new job will automatically be set up to use the original job's recovery journal to bootstrap its Frontier state (of completed and queued URIs). Further, if the recovered job attempts to reuse any already-full 'logs' or 'state' directories, new paths for these directories will be chosen with as many '-R' suffixes as are necessary to specify a new empty directory.

(If you simply base your new job on the old job without using the 'Recover' link, you must manually enter the full path of the original crawl's recovery journal into the recover-path setting, near the end of all settings. You must also adjust the 'logs' and 'state' directory settings, if they were specified using absolute paths that would cause the new crawl to reuse the directories of the original job.)

After making any further adjustments to the crawl settings, submit the new job. The submission will hang a long time as the recover.gz file is read in its entirety by the frontier. (This can take hours for a crawl that has run for a long time, and during this time the crawler control panel will appear idle, with no job pending or in progress, but the machine will be busy.) Eventually the submit and crawl job launche should complete. The crawl should pick up from close to where the crash occurred. There is no marking in the logs that this crawl was started by reading a recover log (Be sure to mark this was done in the crawl journal).

The recovery log is gzipped because it gets very large otherwise and because of the reptition of terms, it compresses very well. On abnormal termination of the crawl job, if you look at the recover.gz file with gzip, gzip will report unexpected end of file if you try to ungzip it. Gzip is complaining that the file write was abnormally terminated. But the recover.gz file will be of use restoring the frontier at least to where the gzip file went bad (Gzip zips in 32k blocks; the worst loss would be the last 32k of gzipped data).

Java's Gzip support (up through at least Java 1.5/5.0) can compress arbitrarily large input streams, but has problems when decompressing any stream to output larger than 2GB. When attempting to recover a crawl that has a recovery log that, when uncompressed, would be over 2GB, this will trigger a FatalConfigurationException alert, with detail message "Recover.log problem: java.io.IOException: Corrupt GZIP trailer". Heritrix will accept either compressed or uncompressed recovery log files, so a workaround is to first uncompress the recovery log using another non-Java tool (such as the 'gunzip' available in Linux and Cygwin), then refer to this uncompressed recovery log when recovering. (Reportedly, Java 6.0 "Mustang" will fix this Java bug with un-gzipping large files.)

See also below, the related recovery facility, Section 9.4, “Checkpointing”, for an alternate recovery mechanism.

9.4. Checkpointing

Checkpointing [Checkpointing], the crawler writes a representation of current state to a directory under checkpoints-path, named for the checkpoint. Checkpointed state includes serialization of main crawler objects, copies of current set of bdbje log files, etc. The idea is that the checkpoint directory contains all that is required recovering a crawler. Checkpointing also rotates off the crawler logs including the recover.gz log, if enabled. Log files are NOT copied to the checkpoint directory. They are left under the logs directory but are distingushed by a suffix. The suffix is the checkpoint name (e.g. crawl.log.000031 where 000031 is the checkpoint name).

Note

Currently, only the BdbFrontier using the bdbje-based already-seen or the bloom filter already-seen is checkpointable.

To run a checkpoint, click the checkpoint button in the UI or invoke checkpoint from JMX. This launches a thread to run through the following steps: If crawling, pause the crawl; run actual checkpoint; if was crawling when checkpoint was invoked, resume the crawl. Dependent on the size of the crawl, checkpointing can take some time; often the step that takes longest is pausing the crawl, waiting on threads to get into a paused, checkpointable state. While checkpointing, the status will show as CHECKPOINTING. When the checkpoint has completed -- the crawler will resume crawling (Of if in PAUSED state when checkpointing was invoked, will return to the PAUSED state).

Recovery from a checkpoint has much in common with the recovery of a crawl using the recover.log (See ???. To recover, create a job. Then before launching, set the crawl-order/recover-path to point at the checkpoint directory you want to recover from. Alternatively, browse to the Jobs->Based on a recovery screen and select the checkpoint you want to recover from. After clicking, a new job will be created that takes the old jobs' (end-of-crawl) settings and autofills the recover-path with the right directory-path (The renaming of logs and crawl-order/state-path "state" dirs so they do not clash with the old as is described above in Section 9.3, “Recovery of Frontier State and recover.gz” is also done). The first thing recover does is copy into place the saved-off bdbje log files. Again, recovery can take time -- an hour or more if a crawl of millions.

Checkpointing is currently experimental. The recover-log technique is tried-and-true. Once checkpointing is proven reliable, faster, and more comprehensive, it will become the preferred method recovering a crawler).

9.4.1. Expert mode: Fast checkpointing

The bulk of the time checkpointing is taken up copying off the bdbje logs. For example, checkpointing a crawl that had downloaded 18million items -- it had discovered > 130million (bloom filter) -- the checkpointing took about 100minutes to complete of which 90 plus minutes were spent copying the ~12k bdbje log files (One disk only involved). Set log level on org.archive.util.FileUtils to FINE to watch the java bdbje log file-copy.

Since copying off bdbje log files can take hours, we've added an expert mode checkpoint that bypasses bdbje log copying. The upside is your checkpoint completes promptly -- in minutes, even if the crawl is large -- but downside is recovery takes more work: to recover from a checkpoint, the bdbje log files need to be manually assembled in the checkpoint bdb-logs subdirectory. You'll know which bdbje log files make up the checkpoint because Heritrix writes the checkpoint list of bdbje logs into the checkpoint directory to a file named bdbj-logs-manifest.txt. To prevent bdbje removing log files that might be needed assembling a checkpoint made at sometime in the past, when running expert mode checkpointing, we configure bdbje not to delete logs when its finished with them; instead, bdbje gives logs its no longer using a .del suffix. Assembling a checkpoint will often require renaming files with the .del suffix so they have the .jdb suffix in accordance with the bdbj-logs-manifest.txt list (See below for more on this).

Note

With this expert mode enabled, the crawler crawl-order/state-path "state" directory will grow without bound; a process external to the crawler can be set to prune the state directory of .del files referenced by checkpoints since superceded).

To enable the no-files copy checkpoint, set the new expert mode setting checkpoint-copy-bdbje-logs to false.

To recover using a checkpoint that has all but the bdbje log files present, you will need to copy all logs listed in bdbj-logs-manifest.txt to the bdbje-logs checkpoint subdirectory. In some cases this will necessitate renaming logs with the .del to instead have the .jdb ending as suggested above. One thing to watch for is copying too many logs into the bdbje logs subdirectory. The list of logs must match exactly whats in the manifest file. Otherwise, the recovery will fail (For example, see [1325961] resurrectOneQueueState has keys for items not in allqueues).

On checkpoint recovery, Heritrix copies bdbje log files from the referenced checkpoint bdb-logs subdirectory to the new crawl's crawl-order/state-path "state" directory. As noted above, this can take some time. Of note, if a bdbje log file already exists in the new crawls' crawl-order/state-path "state" directory, checkpoint recover will not overwrite the existing bdbje log file. Exploit this property and save on recovery time by using native unix cp to manually copy over bdbje log files from the checkpoint directory to the new crawls' crawl-order/state-path "state" directory before launching a recovery (Or, at the extreme, though it will trash your checkpoint, set the checkpoint's bdb-logs subdirectory as the new crawls crawl-order/state-path "state" directory).

9.4.2. Automated Checkpointing

To have Heritrix run a checkpoint on a period, uncomment (or add) to heritrix.properties a line like:

org.archive.crawler.framework.Checkpointer.period = 2
This will install a Timer Thread that will run on an interval (Units are in hours). See heritrix_out.log to see log of installation of the timer thread that will run the checkpoint on a period and to see log of everytime it runs (Assuming org.archive.crawler.framework.Checkpointer.level is set to INFO).

9.5. Remote Monitoring and Control

As of release 1.4.0, Heritrix will start up the JVM's JMX Agent if deployed in a SUN 1.5.0 JVM. It password protects the JMX Agent using whatever was specified as the Heritrix admin password so to login, you'll use 'monitorRole' or 'controlRole' for login and the Heritrix admin password as password. By default, the JMX Agent is started up on port 8849 (To change any of the JMX settings, set the JMX_OPTS environment variable).

On startup, Heritrix looks if any JMX Agent running in current context and registers itself with the first JMX Agent found publishing attributes and operations that can be run remotely. If running in a SUN 1.5.0 JVM where the JVM JMX Agent has been started, Heritrix will attach to the JVM JMX Agent (If running inside JBOSS, Heritrix will register with the JBOSS JMX Agent).

To see what Attributes and Operations are available via JMX, use the SUN 1.5.0 JDK jconsole application -- its in $JAVA_HOME/bin -- or use Section 9.2.4, “cmdline-jmxclient”.

To learn more about the SUN 1.5.0 JDK JMX managements and jconsole, see Monitoring and Management Using JMX. This O'Reilly article is also a good place for getting started : Monitoring Local and Remote Applications Using JMX 1.2 and JConsole.

9.6. Experimental FTP Support

As of release 1.10.0, Heritrix has experimental support for crawling FTP servers. To enable FTP support for your crawls, there is a configuration file change you will have to manually make.

Specifically, you will have to edit the $HERITRIX_HOME/conf/heritrix.properties file. Remove ftp from the org.archive.net.UURIFactory.ignored-schemes property list. Also, you must add ftp to the org.archive.net.UURIFactory.schemes property list.

After that change, you should be able to add the FetchFTP processor to your crawl using the Web UI. Just create a new job, click "Modules", and add FetchFTP under "Fetchers."

Note that FetchFTP is a little unusual in that it works both as a fetcher and as an extractor. If an FTP URI refers to a directory, and if FetchFTP's extract-from-dirs property is set to true, then FetchFTP will extract one link for every line of the directory listing. Similarly, if the extract-parent property is true, then FetchFTP will extract the parent directory from every FTP URI it encounters.

Also, remember that FetchFTP is experimental. As of 1.10, FetchFTP has the following known limitations:

  1. FetchFTP can only store directories if the FTP server supports the NLIST command. Some older systems may not support NLIST .
  2. Similarly, FetchFTP uses passive mode transfer, to work behind firewalls. Not all FTP servers support passive mode, however.
  3. Heritrix currently has no means of determining the mime-type of a document unless an HTTP server explicitly mentions one. Since FTP has no concept of metadata, all documents retrieved using FetchFTP have a mime-type of no-type .
  4. In the absence of a mime-type, many of the postprocessors will not work. For instance, HTMLExtractor will not extract links from an HTML file fetched with FetchFTP.

Still, FetchFTP can be used to archive an FTP directory of tarballs, for instance. If you discover any additional problems using FetchFTP, please inform the mailing list.

9.7. Duplication Reduction Processors

Starting in release 1.12.0, a number of Processors can cooperate to carry forward URI content history information between crawls, reducing the amount of duplicate material downloaded or stored in later crawls. For more information, see the project wiki's notes on using the new duplication-reduction functionality.