8. Analysis of jobs

Heritrix offers several facilities for examining the details of a crawl. The reports and logs are also availible at run time.

8.1. Completed jobs

In the Jobs tab (and page headers) is a listing of how many completed jobs there are along with a link to a page that lists them.

The following information / options are provided for each completed job:

  • UID

    Each job has a unique (generated) ID. This is actually a time stamp. It differentiates jobs with the same name from one another.

    This ID is used (among other things) for creating the job's directory on disk.

  • Job name

    The name that the user gave the job.

  • Status

    Status of the job. Indicates how it ended.

  • Options

    In addtion the following options are available for each job.

    • Crawl order

      Opens up the actual XML file of the jobs configuration in a seperate window. Generally only of interest to advanced users.

    • Crawl report

      Takes the user to the job's Crawl report (Section 8.3.1, “Crawl report”).

    • Seeds report

      Takes the user to the job's Seeds report (Section 8.3.2, “Seeds report”).

    • Seed file

      Displays the seed

    • Logs

      Takes the user to the job's logs (Section 8.2, “Logs”).

    • Journal

      Takes the user to the Journal page for the job (Section 7.4.1, “Journal”). Users can still add entries to it.

    • Delete

      Marks the job as deleted. This will remove it from the WUI but not from disk.

Note

It is not possible to directly access the configuration for completed jobs in the same way as you can for new, pending and running jobs. Instead users can look at the actual XML configuration file or create a new job based on the old one. The new job (and it need never be run) will perfectly mirror the settings of the old one.

8.2. Logs

Heritrix writes several logs as it crawls a job. Each crawl job has its own set of these logs.

The location where logs are written can be configured (expert setting). Otherwise refer to the crawl-manifest.txt for on disk location of logs (Section 9.1.2, “crawl-manifest.txt”).

Logs can be manually rotated. Pause the crawl and at the base of the screen a Rotate Logs link will appear. Clicking on Rotate Logs will move aside all current crawl logs appending a 14-digit GMT timestamp to the moved-aside logs. New log files will be opened for the crawler to use in subsequent crawling.

The WUI offers users four ways of viewing these logs by:

  1. Line number

    View a section of a log that starts at a given line number and the next X lines following it. X is configurable, is 50 by default.

  2. Time stamp

    View a section of a log that starts at a given time stamp and the next X lines following it. X is configurable, is 50 by default. The format of the time stamp is the same as in the logs (YYYY-MM-DDTHH:MM:SS.SSS). It is not necessary to add more detail to this then is desired. For instance the entry 2004-04-25T08 will match the first entry made after 8 am on the 25 of April, 2004.

  3. Regular expression

    Filter the log based on a regular expression. Only lines matching it (and optionally lines following it that are indented - usually meaning that they are related to the previous ones) are displayed.

    This can be an expensive operation on really big logs, requiring a lot of time for the page to load.

  4. Tail

    Allows users to just look at the last X lines of the given log. X can be configured, is 50 by default.

8.2.1. crawl.log

For each URI tried will get an entry in the crawl.log regardless of success or failure.

Below is a two line extract from a crawl.log:

2004-07-21T23:29:40.438Z   200        310 http://127.0.0.1:9999/selftest/Charset/charsetselftest_end.html LLLL http://127.0.0.1:9999/selftest/Charset/shiftjis.jsp text/html #000 20040721232940401+10 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM -
2004-07-21T23:29:40.502Z   200        225 http://127.0.0.1:9999/selftest/MaxLinkHops/5.html LLLLL http://127.0.0.1:9999/selftest/MaxLinkHops/4.html text/html #000 20040721232940481+12 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM -

The 1st column is a timestamp in ISO8601 format, to millisecond resolution. The time is the instant of logging. The 2nd column is the fetch status code. Usually this is the HTTP status code but it can also be a negative number if URL processing was unexpectedly terminated. See Status codes for a listing of possible values.

The 3rd column is the size of the downloaded document in bytes. For HTTP, Size is the size of the content-only. It excludes the size of the HTTP response headers. For DNS, its the total size of the DNS response. The 4th column is the URI of the document downloaded. The 5th column holds breadcrumb codes showing the trail of downloads that got us to the current URI. See Discovery path for description of possible code values. The 6th column holds the URI that immediately referenced this URI ('referrer'). Both of the latter two fields -- the discovery path and the referrer URL -- will be empty for such as the seed URIs.

The 7th holds the document mime type, the 8th column has the id of the worker thread that downloaded this document, the 9th column holds a timestamp (in RFC2550/ARC condensed digits-only format) indicating when a network fetch was begun, and if appropriate, the millisecond duration of the fetch, separated from the begin-time by a '+' character.

The 10th field is a SHA1 digest of the content only (headers are not digested). The 11th column is the 'source tag' inherited by this URI, if that feature is enabled. Finally, the 12th column holds “annotations”, if any have been set. Possible annontations include: the number of times the URI was tried (This field is '-' if the download was never retried); the literal lenTrunc if the download was truncated because it exceeded configured limits; timeTrunc if the download was truncated because the download time exceeded configured limits; or midFetchTrunc if a midfetch filter determined the download should be truncated.

8.2.2. local-errors.log

Errors that occur when processing a URI that can be handled by the processors (usually these are network related problems trying to fetch the document) are logged here.

Generally these can be safely ignored, but can provide insight to advanced users when other logs and/or reports have unusual data.

8.2.3. progress-statistics.log

This log is written by the StatisticsTracker (Section 6.1.4, “Statistics Tracking”).

At configurable intervals a line about the progress of the crawl is written to this file.

The legends are as follows:

  • timestamp

    Timestamp indicating when the line was written, in ISO8601 format.

  • discovered

    Number of URIs discovered to date.

  • queued

    Number of URIs queued at the moment.

  • downloaded

    Number of URIs downloaded to date

  • doc/s(avg)

    Number of documents downloaded per second since the last snapshot. In parenthesis since the crawl began.

  • KB/s(avg)

    Amount in Kilobytes downloaded per second since the last snapshot. In parenthesis since the crawl began.

  • dl-failures

    Number of URIs that Heritrix has failed to download to date.

  • busy-thread

    Number of toe threads currently busy processing a URI.

  • mem-use-KB

    Amount of memory currently assigned to the Java Virtual Machine.

8.2.4. runtime-errors.log

This log captures unexpected exceptions and errors that occur during the crawl. Some may be due to hardware limitation (out of memory, although that error may occur without being written to this log), but most are probably because of software bugs, either in Heritrix's core but more likely in one of the pluggable classes.

8.2.5. uri-errors.log

Contains errors in dealing with encountered URIs. Usually its caused by erroneous URIs. Generally only of interest to advanced users trying to explain unexpected crawl behavior.

8.2.6. recover.gz

The recover.gz file is a gzipped journal of Frontier events. It can be used to restore the Frontier after a crash to roughly the state it had before the crash. See Section 9.3, “Recovery of Frontier State and recover.gz” to learn more.

8.3. Reports

Heritrix's WUI offers a couple of reports on ongoing and completed crawl jobs.

Both are accessible via the Reports tab.

Note

Although jobs are loaded after restarts of the software, their statistics are not reloaded with them. That means that these reports are only available as long as Heritrix is not shut down. All of the information is however replicated in report files at the end of each crawl for permanent storage.

8.3.1. Crawl report

At the top of the crawl report some general statistics about the crawl are printed out. All of these replicate data from the Console so you should refer to Section 7.1, “Web Console” for more information on them.

Next in line are statistics about the number of URIs pending, discovered, currently queued, downloaded etc. Question marks after most of the values provides pop up descriptions of those metrics.

Following that is a breakdown of the distribution of status codes among URIs. It is sorted from most frequent to least. The number of URIs found for each status code is displayed. Only successful fetches are counted here.

A similar breakdown for file types (mime types) follows. In addition to the number of URIs per file type, the amount of data for that file type is also displayed.

Last a breakdown per host is provided. Number of URIs and amount of data for each is presented. The time that has elapsed since the last URI was finished for each host is also displayed for ongoing crawls. This value can provide valuable data on what hosts are still being actively crawled. Note that this value is only available while the crawl is in progress since it has no meaning afterwards. Also any pauses made to the crawl may distort these values, at least in the short term following resumption of crawling. Most noticably while paused all of these values will continue to grow.

Especially in broad crawls, this list can grow very large.

8.3.2. Seeds report

This report lists all the seeds in the seeds file and also any discovered seeds if that option is enabled (that is treat redirects from seeds as new seeds). For each seed the status code for the fetch attempt is presented in verbose form (that is with minimum textual description of its meaning). Following that is the seeds disposition, a quick look at if the seed was successfully crawled, not attempted, or failed to crawl.

Successfully crawled seeds are any that Heritrix had no internal errors crawling, the seed may never the less have generated a 404 (file not found) error.

Failure to crawl might be because of a bug in Heritrix or an invalid seed (commonly DNS lookup will have failed).

If the report is examined before the crawl is finished there might still be seeds not yet attempted. Especially if there is trouble getting their prerequisites or if the seed list is exceptionally large.