While it is possible to do a great many things via Heritrix's WUI it is worth taking a look at some of what is not available in it.
In addition to the logs discussed above (see Section 8.2, “Logs”) the following files are generated. Some of the information in them is also available via the WUI.
Captures what is written to the standard out and standard error streams of the program. Mostly this consists of low level exceptions (usually indicative of bugs) and also some information from third party modules who do their own output logging.
This file is created in the same directory as the Heritrix JAR file. It is not associated with any one job, but contains output from all jobs run by the crawler.
A manifest of all files (excluding ARC and other data files) created while crawling a job.
An example of this file might be:
L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/crawl.log L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/runtime-errors.log L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/local-errors.log L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/uri-errors.log L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/progress-statistics.log L- /Heritrix/jobs/quickbroad-20040420191411593/disk/recover.gz R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/seeds-report.txt R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/hosts-report.txt R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/mimetype-report.txt R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/responsecode-report.txt R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/crawl-report.txt R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/processors-report.txt C+ /Heritrix/jobs/quickbroad-20040420191411593/job-quickbroad.xml C+ /Heritrix/jobs/quickbroad-20040420191411593/settings/org/settings.xml C+ /Heritrix/jobs/quickbroad-20040420191411593/seeds-quickbroad.txt
The first character of each line indicates the type of file. L for logs, R for reports and C for configuration files.
The second character - a plus or minus sign - indicates if the
file should be included in a standard bundle of the job (see Section 9.2.1, “manifest_bundle.pl”). In the example above the
recover.gz
is marked for exclusion because it is
generally only of interest if the job crashes and must be restarted.
It has negligible value once the job is completed (See Section 9.3, “Recovery of Frontier State and recover.gz”).
After this initial legend the filename with full path follows.
This file is generated in the directory indicated by the 'disk' attribute of the configuration at the very end of the crawl.
Contains some useful metrics about the completed jobs. This
report is created by the StatisticsTracker
(see
Section 6.1.4, “Statistics Tracking”)
Written at the very end of the crawl only. See
crawl-manifest.txt
for its location.
Contains an overview of what hosts were crawled and how many documents and bytes were downloaded from each.
This report is created by the
StatisticsTracker
(see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl
only. See crawl-manifest.txt
for its
location.
Contains on overview of the number of documents downloaded per mime type. Also has the amount of data downloaded per mime type.
This report is created by the
StatisticsTracker
(see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl
only. See crawl-manifest.txt
for its
location.
Contains the processors report (see Section 7.3.1.3, “Processors report”) generated at the very end of the crawl.
Contains on overview of the number of documents downloaded per
status code (see Status codes), covers successful
codes only, does not tally failures, see crawl.log
for that information.
This report is created by the
StatisticsTracker
(see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl
only. See crawl-manifest.txt
for its
location.
An overview of the crawling of each seed. Did it succeed or not, what status code was returned.
This report is created by the
StatisticsTracker
(see Section 6.1.4, “Statistics Tracking”) and is written at the very end of the crawl
only. See crawl-manifest.txt
for its
location.
Assuming that you are using the ARC writer that comes with Heritrix a number of ARC files will be generated containing the crawled pages.
It is possible to specify the location of these files on the ARCWriter processor in settings page. Unless this is set as an absolute path this is a path relative to the job directory.
ARC files are named as follows:
[prefix]-[12-digit-timestamp]-[series#-padded-to-5-digits]-[crawler-hostname].arc.gz
The prefix
is set by the user when he
configures the ARCWriter processor. By default it is IAH.
If you see an ARC file with an extra .open
suffix, this means the ARC is currently in use being written to by
Heritrix (It usually has more than one ARC open at a time).
Files with a .invalid
are files
Heritrix had trouble writing to (Disk full, bad disk, etc.). On
IOException, Heritrix closes the problematic ARC and gives it the
.invalid
suffix. These files need to be checked for
coherence.
For more on ARC files refer to the ARCwriter Javadoc and to the ARC Writer developer documentation.
Heritrix comes bundled with a few helpful scripts for Linux.
This script will bundle up all resources referenced by a crawl manifest file (Section 9.1.2, “crawl-manifest.txt”). Output bundle is an uncompressed or compressed tar ball. Directory structure created in the tar ball is as follow:
Top level directory (crawl name)
Three default subdirectories (configuration, logs and reports directories)
Any other arbitrary subdirectories
Usage:
manifest_bundle.pl crawl_name manifest_file [-f output_tar_file] [-z] [ -flag directory] -f output tar file. If omitted output to stdout. -z compress tar file with gzip. -flag is any upper case letter. Default values C, L, and are R are set to configuration, logs and reports
Example:
manifest_bundle.pl testcrawl crawl-manifest.txt -f \ /0/testcrawl/manifest-bundle.tar.gz -z -F filters
Produced tar ball for this example:
/0/testcrawl/manifest-bundle.tar.gzBundled directory structure for this example:
|-testcrawl |- configurations |- logs |- reports |- filters
This perl script, found in $HERITRIX_HOME/bin recreates the hop path to the specified url. The hop path is a path of links (URLs) that we followed to get to the specified url.
Usage:
Usage: hoppath.pl crawl.log URI_PREFIX crawl.log Full-path to Heritrix crawl.log instance. URI_PREFIX URI we're querying about. Must begin 'http(s)://' or 'dns:'. Wrap this parameter in quotes to avoid shell interpretation of any '&' present in URI_PREFIX.
Example:
% hoppath.pl crawl.log 'http://www.house.gov/'
Result:
2004-02-25-02-36-06 - http://www.house.gov/house/MemberWWW_by_State.html 2004-02-25-02-36-06 L http://wwws.house.gov/search97cgi/s97_cgi 2004-02-25-03-30-38 L http://www.house.gov/
The L in the above example refers to the type of link followed (see Discovery path).
org.archive.crawler.util.RecoveryLogMapper
is
similar to Section 9.2.2, “hoppath.pl”. It was contributed by Mike
Schwartz. RecoverLogMapper parses a Heritrix recovery log file, (See
Section 9.3, “Recovery of Frontier State and recover.gz”), and builds maps that allow a caller to
look up any seed URL and get back an Iterator of all URLs successfully
crawled from given seed. It also allows lookup on any crawled URL to
find the seed URL from which the crawler reached that URL (through 1
or more discovered URL hops, which are collapsed in this
lookup).
This jar file is checked in as a script. It enables command-line control of Heritrix if Heritrix has been started up inside of a SUN 1.5.0 JDK. See the cmdline-jmxclient project to learn more about this script's capabilities and how to use it. See also Section 9.5, “Remote Monitoring and Control”.
During normal running, the Heritrix Frontier by default keeps a
journal. The journal is kept in the jobs logs directory. Its named
recover.gz
. If a crawl crashes, the recover.gz
journal can be used to recreate approximately the status of the crawler
at the time of the crash. Recovery can take a long time in some cases,
but is usually much quicker then repeating a crawl.
To run the recovery process, relaunch the crashed crawler. Create a new crawl order job based on the crawl that crashed. If you choose the "recover-log" link from the list of completed jobs in the 'Based on recovery' page, the new job will automatically be set up to use the original job's recovery journal to bootstrap its Frontier state (of completed and queued URIs). Further, if the recovered job attempts to reuse any already-full 'logs' or 'state' directories, new paths for these directories will be chosen with as many '-R' suffixes as are necessary to specify a new empty directory.
(If you simply base your new job on the old job without using the
'Recover' link, you must manually enter the full path of the original
crawl's recovery journal into the recover-path
setting, near the end of all settings. You must also adjust the 'logs'
and 'state' directory settings, if they were specified using absolute
paths that would cause the new crawl to reuse the directories of the
original job.)
After making any further adjustments to the crawl settings, submit the new job. The submission will hang a long time as the recover.gz file is read in its entirety by the frontier. (This can take hours for a crawl that has run for a long time, and during this time the crawler control panel will appear idle, with no job pending or in progress, but the machine will be busy.) Eventually the submit and crawl job launche should complete. The crawl should pick up from close to where the crash occurred. There is no marking in the logs that this crawl was started by reading a recover log (Be sure to mark this was done in the crawl journal).
The recovery log is gzipped because it gets very large otherwise
and because of the reptition of terms, it compresses very well. On
abnormal termination of the crawl job, if you look at the recover.gz
file with gzip, gzip will report unexpected end of
file
if you try to ungzip it. Gzip is complaining that the
file write was abnormally terminated. But the recover.gz file will be of
use restoring the frontier at least to where the gzip file went bad
(Gzip zips in 32k blocks; the worst loss would be the last 32k of
gzipped data).
Java's Gzip support (up through at least Java 1.5/5.0) can compress arbitrarily large input streams, but has problems when decompressing any stream to output larger than 2GB. When attempting to recover a crawl that has a recovery log that, when uncompressed, would be over 2GB, this will trigger a FatalConfigurationException alert, with detail message "Recover.log problem: java.io.IOException: Corrupt GZIP trailer". Heritrix will accept either compressed or uncompressed recovery log files, so a workaround is to first uncompress the recovery log using another non-Java tool (such as the 'gunzip' available in Linux and Cygwin), then refer to this uncompressed recovery log when recovering. (Reportedly, Java 6.0 "Mustang" will fix this Java bug with un-gzipping large files.)
See also below, the related recovery facility, Section 9.4, “Checkpointing”, for an alternate recovery mechanism.
Checkpointing [Checkpointing], the crawler
writes a representation of current state to a directory under
checkpoints-path
, named for the checkpoint.
Checkpointed state includes serialization of main crawler objects,
copies of current set of bdbje log files, etc. The idea is that the
checkpoint directory contains all that is required recovering a crawler.
Checkpointing also rotates off the crawler logs including the
recover.gz
log, if enabled. Log files are NOT copied
to the checkpoint directory. They are left under the logs directory but
are distingushed by a suffix. The suffix is the checkpoint name (e.g.
crawl.log.000031
where 000031
is
the checkpoint name).
Currently, only the BdbFrontier using the bdbje-based already-seen or the bloom filter already-seen is checkpointable.
To run a checkpoint, click the checkpoint button in the UI or
invoke checkpoint
from JMX. This launches a thread to
run through the following steps: If crawling, pause the crawl; run
actual checkpoint; if was crawling when checkpoint was invoked, resume
the crawl. Dependent on the size of the crawl, checkpointing can take
some time; often the step that takes longest is pausing the crawl,
waiting on threads to get into a paused, checkpointable state. While
checkpointing, the status will show as CHECKPOINTING
.
When the checkpoint has completed -- the crawler will resume crawling
(Of if in PAUSED state when checkpointing was invoked, will return to
the PAUSED state).
Recovery from a checkpoint has much in common with the recovery of
a crawl using the recover.log (See ???. To
recover, create a job. Then before launching, set the
crawl-order/recover-path
to point at the checkpoint
directory you want to recover from. Alternatively, browse to the
Jobs->Based on a recovery
screen and select the
checkpoint you want to recover from. After clicking, a new job will be
created that takes the old jobs' (end-of-crawl) settings and autofills
the recover-path with the right directory-path (The renaming of logs and
crawl-order/state-path
"state" dirs so they do not
clash with the old as is described above in Section 9.3, “Recovery of Frontier State and recover.gz”
is also done). The first thing recover does is copy into place the
saved-off bdbje log files. Again, recovery can take time -- an hour or
more if a crawl of millions.
Checkpointing is currently experimental. The recover-log technique is tried-and-true. Once checkpointing is proven reliable, faster, and more comprehensive, it will become the preferred method recovering a crawler).
The bulk of the time checkpointing is taken up copying off the
bdbje logs. For example, checkpointing a crawl that had downloaded
18million items -- it had discovered > 130million (bloom filter) --
the checkpointing took about 100minutes to complete of which 90 plus
minutes were spent copying the ~12k bdbje log files (One disk only
involved). Set log level on
org.archive.util.FileUtils
to FINE to watch the
java bdbje log file-copy.
Since copying off bdbje log files can take hours, we've added an
expert mode checkpoint that bypasses bdbje log
copying. The upside is your checkpoint completes promptly -- in
minutes, even if the crawl is large -- but downside is recovery takes
more work: to recover from a checkpoint, the bdbje log files need to
be manually assembled in the checkpoint bdb-logs
subdirectory. You'll know which bdbje log files make up the checkpoint
because Heritrix writes the checkpoint list of bdbje logs into the
checkpoint directory to a file named
bdbj-logs-manifest.txt
. To prevent bdbje removing
log files that might be needed assembling a checkpoint made at
sometime in the past, when running expert mode checkpointing, we
configure bdbje not to delete logs when its finished with them;
instead, bdbje gives logs its no longer using a
.del
suffix. Assembling a checkpoint will often
require renaming files with the .del
suffix so they
have the .jdb
suffix in accordance with the
bdbj-logs-manifest.txt
list (See below for more on
this).
With this expert mode enabled, the crawler
crawl-order/state-path
"state" directory will
grow without bound; a process external to the crawler can be set to
prune the state directory of .del
files
referenced by checkpoints since superceded).
To enable the no-files copy checkpoint, set the new expert mode
setting checkpoint-copy-bdbje-logs
to
false
.
To recover using a checkpoint that has all but the bdbje log
files present, you will need to copy all logs listed in
bdbj-logs-manifest.txt
to the
bdbje-logs
checkpoint subdirectory. In some cases
this will necessitate renaming logs with the .del
to instead have the .jdb
ending as suggested above.
One thing to watch for is copying too many logs into the bdbje logs
subdirectory. The list of logs must match exactly whats in the
manifest file. Otherwise, the recovery will fail (For example, see
[1325961]
resurrectOneQueueState has keys for items not in
allqueues).
On checkpoint recovery, Heritrix copies bdbje log files from the
referenced checkpoint bdb-logs
subdirectory to the
new crawl's crawl-order/state-path
"state"
directory. As noted above, this can take some time. Of note, if a
bdbje log file already exists in the new crawls'
crawl-order/state-path
"state" directory,
checkpoint recover will not overwrite the existing bdbje log file.
Exploit this property and save on recovery time by using native unix
cp
to manually copy over bdbje log files from the
checkpoint directory to the new crawls'
crawl-order/state-path
"state" directory before
launching a recovery (Or, at the extreme, though it will trash your
checkpoint, set the checkpoint's bdb-logs
subdirectory as the new crawls
crawl-order/state-path
"state" directory).
To have Heritrix run a checkpoint on a period, uncomment (or
add) to heritrix.properties
a line like:
org.archive.crawler.framework.Checkpointer.period = 2This will install a Timer Thread that will run on an interval (Units are in hours). See
heritrix_out.log
to see log of
installation of the timer thread that will run the checkpoint on a
period and to see log of everytime it runs (Assuming
org.archive.crawler.framework.Checkpointer.level
is
set to INFO).As of release 1.4.0, Heritrix will start up the JVM's JMX Agent if deployed in a SUN 1.5.0 JVM. It password protects the JMX Agent using whatever was specified as the Heritrix admin password so to login, you'll use 'monitorRole' or 'controlRole' for login and the Heritrix admin password as password. By default, the JMX Agent is started up on port 8849 (To change any of the JMX settings, set the JMX_OPTS environment variable).
On startup, Heritrix looks if any JMX Agent running in current context and registers itself with the first JMX Agent found publishing attributes and operations that can be run remotely. If running in a SUN 1.5.0 JVM where the JVM JMX Agent has been started, Heritrix will attach to the JVM JMX Agent (If running inside JBOSS, Heritrix will register with the JBOSS JMX Agent).
To see what Attributes and Operations are available via JMX, use the SUN 1.5.0 JDK jconsole application -- its in $JAVA_HOME/bin -- or use Section 9.2.4, “cmdline-jmxclient”.
To learn more about the SUN 1.5.0 JDK JMX managements and jconsole, see Monitoring and Management Using JMX. This O'Reilly article is also a good place for getting started : Monitoring Local and Remote Applications Using JMX 1.2 and JConsole.
As of release 1.10.0, Heritrix has experimental support for crawling FTP servers. To enable FTP support for your crawls, there is a configuration file change you will have to manually make.
Specifically, you will have to edit the
$HERITRIX_HOME/conf/heritrix.properties
file.
Remove ftp
from the
org.archive.net.UURIFactory.ignored-schemes
property
list. Also, you must add ftp
to the
org.archive.net.UURIFactory.schemes
property
list.
After that change, you should be able to add the FetchFTP processor to your crawl using the Web UI. Just create a new job, click "Modules", and add FetchFTP under "Fetchers."
Note that FetchFTP is a little unusual in that it works both as a
fetcher and as an extractor. If an FTP URI refers to a directory, and if
FetchFTP's extract-from-dirs
property is set to true,
then FetchFTP will extract one link for every line of the directory
listing. Similarly, if the extract-parent
property is
true, then FetchFTP will extract the parent directory from every FTP URI
it encounters.
Also, remember that FetchFTP is experimental. As of 1.10, FetchFTP has the following known limitations:
NLIST
command. Some older systems may not support
NLIST
.
no-type
.
Still, FetchFTP can be used to archive an FTP directory of
tarballs, for instance. If you discover any additional problems using
FetchFTP, please inform the
<archive-crawler@yahoogroups.com>
mailing list.
Starting in release 1.12.0, a number of Processors can cooperate to carry forward URI content history information between crawls, reducing the amount of duplicate material downloaded or stored in later crawls. For more information, see the project wiki's notes on using the new duplication-reduction functionality.