Heritrix offers several facilities for examining the details of a crawl. The reports and logs are also availible at run time.
In the Jobs tab (and page headers) is a listing of how many completed jobs there are along with a link to a page that lists them.
The following information / options are provided for each completed job:
Each job has a unique (generated) ID. This is actually a time stamp. It differentiates jobs with the same name from one another.
This ID is used (among other things) for creating the job's directory on disk.
The name that the user gave the job.
Status of the job. Indicates how it ended.
In addtion the following options are available for each job.
Opens up the actual XML file of the jobs configuration in a seperate window. Generally only of interest to advanced users.
Takes the user to the job's Crawl report (Section 8.3.1, “Crawl report”).
Takes the user to the job's Seeds report (Section 8.3.2, “Seeds report”).
Displays the seed
Takes the user to the job's logs (Section 8.2, “Logs”).
Takes the user to the Journal page for the job (Section 7.4.1, “Journal”). Users can still add entries to it.
Marks the job as deleted. This will remove it from the WUI but not from disk.
It is not possible to directly access the configuration for completed jobs in the same way as you can for new, pending and running jobs. Instead users can look at the actual XML configuration file or create a new job based on the old one. The new job (and it need never be run) will perfectly mirror the settings of the old one.
Heritrix writes several logs as it crawls a job. Each crawl job has its own set of these logs.
The location where logs are written can be configured (expert
setting). Otherwise refer to the
for on disk location of logs (Section 9.1.2, “crawl-manifest.txt”).
Logs can be manually rotated. Pause the crawl and at the base of the screen a Rotate Logs link will appear. Clicking on Rotate Logs will move aside all current crawl logs appending a 14-digit GMT timestamp to the moved-aside logs. New log files will be opened for the crawler to use in subsequent crawling.
The WUI offers users four ways of viewing these logs by:
View a section of a log that starts at a given line number and the next X lines following it. X is configurable, is 50 by default.
View a section of a log that starts at a given time stamp and the next X lines following it. X is configurable, is 50 by default. The format of the time stamp is the same as in the logs (YYYY-MM-DDTHH:MM:SS.SSS). It is not necessary to add more detail to this then is desired. For instance the entry 2004-04-25T08 will match the first entry made after 8 am on the 25 of April, 2004.
Filter the log based on a regular expression. Only lines matching it (and optionally lines following it that are indented - usually meaning that they are related to the previous ones) are displayed.
This can be an expensive operation on really big logs, requiring a lot of time for the page to load.
Allows users to just look at the last X lines of the given log. X can be configured, is 50 by default.
For each URI tried will get an entry in the
crawl.log regardless of success or failure.
Below is a two line extract from a crawl.log:
2004-07-21T23:29:40.438Z 200 310 http://127.0.0.1:9999/selftest/Charset/charsetselftest_end.html LLLL http://127.0.0.1:9999/selftest/Charset/shiftjis.jsp text/html #000 20040721232940401+10 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM - 2004-07-21T23:29:40.502Z 200 225 http://127.0.0.1:9999/selftest/MaxLinkHops/5.html LLLLL http://127.0.0.1:9999/selftest/MaxLinkHops/4.html text/html #000 20040721232940481+12 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM -
The 1st column is a timestamp in ISO8601 format, to millisecond resolution. The time is the instant of logging. The 2nd column is the fetch status code. Usually this is the HTTP status code but it can also be a negative number if URL processing was unexpectedly terminated. See Status codes for a listing of possible values.
The 3rd column is the size of the downloaded document in bytes. For HTTP, Size is the size of the content-only. It excludes the size of the HTTP response headers. For DNS, its the total size of the DNS response. The 4th column is the URI of the document downloaded. The 5th column holds breadcrumb codes showing the trail of downloads that got us to the current URI. See Discovery path for description of possible code values. The 6th column holds the URI that immediately referenced this URI ('referrer'). Both of the latter two fields -- the discovery path and the referrer URL -- will be empty for such as the seed URIs.
The 7th holds the document mime type, the 8th column has the id of the worker thread that downloaded this document, the 9th column holds a timestamp (in RFC2550/ARC condensed digits-only format) indicating when a network fetch was begun, and if appropriate, the millisecond duration of the fetch, separated from the begin-time by a '+' character.
The 10th field is a SHA1 digest
of the content only (headers are not digested). The 11th column is the 'source tag' inherited by
this URI, if that feature is enabled. Finally, the 12th column holds “annotations”,
if any have been set. Possible annontations include: the number of
times the URI was tried (This field is '-' if the download was never
retried); the literal
lenTrunc if the download was
truncated because it exceeded configured limits;
timeTrunc if the download was truncated because the
download time exceeded configured limits; or
midFetchTrunc if a midfetch filter determined the
download should be truncated.
Errors that occur when processing a URI that can be handled by the processors (usually these are network related problems trying to fetch the document) are logged here.
Generally these can be safely ignored, but can provide insight to advanced users when other logs and/or reports have unusual data.
This log is written by the
(Section 6.1.4, “Statistics Tracking”).
At configurable intervals a line about the progress of the crawl is written to this file.
The legends are as follows:
Timestamp indicating when the line was written, in ISO8601 format.
Number of URIs discovered to date.
Number of URIs queued at the moment.
Number of URIs downloaded to date
Number of documents downloaded per second since the last snapshot. In parenthesis since the crawl began.
Amount in Kilobytes downloaded per second since the last snapshot. In parenthesis since the crawl began.
Number of URIs that Heritrix has failed to download to date.
Number of toe threads currently busy processing a URI.
Amount of memory currently assigned to the Java Virtual Machine.
This log captures unexpected exceptions and errors that occur during the crawl. Some may be due to hardware limitation (out of memory, although that error may occur without being written to this log), but most are probably because of software bugs, either in Heritrix's core but more likely in one of the pluggable classes.
Contains errors in dealing with encountered URIs. Usually its caused by erroneous URIs. Generally only of interest to advanced users trying to explain unexpected crawl behavior.
The recover.gz file is a gzipped journal of Frontier events. It can be used to restore the Frontier after a crash to roughly the state it had before the crash. See Section 9.3, “Recovery of Frontier State and recover.gz” to learn more.
Heritrix's WUI offers a couple of reports on ongoing and completed crawl jobs.
Both are accessible via the Reports tab.
Although jobs are loaded after restarts of the software, their statistics are not reloaded with them. That means that these reports are only available as long as Heritrix is not shut down. All of the information is however replicated in report files at the end of each crawl for permanent storage.
At the top of the crawl report some general statistics about the crawl are printed out. All of these replicate data from the Console so you should refer to Section 7.1, “Web Console” for more information on them.
Next in line are statistics about the number of URIs pending, discovered, currently queued, downloaded etc. Question marks after most of the values provides pop up descriptions of those metrics.
Following that is a breakdown of the distribution of status codes among URIs. It is sorted from most frequent to least. The number of URIs found for each status code is displayed. Only successful fetches are counted here.
A similar breakdown for file types (mime types) follows. In addition to the number of URIs per file type, the amount of data for that file type is also displayed.
Last a breakdown per host is provided. Number of URIs and amount of data for each is presented. The time that has elapsed since the last URI was finished for each host is also displayed for ongoing crawls. This value can provide valuable data on what hosts are still being actively crawled. Note that this value is only available while the crawl is in progress since it has no meaning afterwards. Also any pauses made to the crawl may distort these values, at least in the short term following resumption of crawling. Most noticably while paused all of these values will continue to grow.
Especially in broad crawls, this list can grow very large.
This report lists all the seeds in the seeds file and also any discovered seeds if that option is enabled (that is treat redirects from seeds as new seeds). For each seed the status code for the fetch attempt is presented in verbose form (that is with minimum textual description of its meaning). Following that is the seeds disposition, a quick look at if the seed was successfully crawled, not attempted, or failed to crawl.
Successfully crawled seeds are any that Heritrix had no internal errors crawling, the seed may never the less have generated a 404 (file not found) error.
Failure to crawl might be because of a bug in Heritrix or an invalid seed (commonly DNS lookup will have failed).
If the report is examined before the crawl is finished there might still be seeds not yet attempted. Especially if there is trouble getting their prerequisites or if the seed list is exceptionally large.