Glossary

Some definitions

Bytes, KB and statistics

Heritrix adheres to the following conventions for displaying byte and bit amounts:

  Legend Type
       B Bytes
      KB Kilobytes - 1 KB = 1024 B
      MB Megabytes - 1 MB = 1024 KB
      GB Gigabytes - 1 GB = 1024 MB
  
       b bits
      Kb Kilobits - 1 Kb = 1000 b
      Mb Megabits - 1 Mb = 1000 Kb
      Gb Gigabits - 1 Gb = 1000 Mb

This also applies to all logs.

Checkpointing

Heritrix checkpointing has been heavily influenced by what Mercator provided. In one of the papers on Mercator it is described this way: “Checkpointing is an important part of any long-running process such as a web crawl. By checkpointing we mean writing a representation of the crawler's state to stable storage that, in the event of a failure, is sufficient to allow the crawler to recover its state by reading the checkpoint and to resume crawling from the exact state it was in at the time of the checkpoint. By this definition, in the event of a failure, any work performed after the most recent checkpoint is lost, but none of the work up to the most recent checkpoint. In Mercator, the frequency with which the background thread performs a checkpoint is user-configurable; we typically checkpoint anywhere from 1 to 4 times per day.

See Section 9.4, “Checkpointing” for discussion of the Heritrix implementation.

CrawlURI

A URI and its associated data such as parent URI, number of links from seed etc.

Dates and times

All times in Heritrix are GMT assuming the clock and timezone on the local system are correct.

This means that all dates/times in logs are GMT, all dates and times shown in the WUI are GMT and any times or dates entered by the user need to be in GMT.

Discovered URIs

That is any URI that has been confirmed be within 'scope'. This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Note

This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revisit' strategies.

Discovery path

Each URI has a discovery path. The path contains one character for each link or embed followed from the seed.

The character legend is as follows.

  R - Redirect
  E - Embed
  X - Speculative embed (aggressive/Javascript link extraction)
  L - Link
  P - Prerequisite (as for DNS or robots.txt before another URI)

The discovery path of seeds is an empty string.

Frontier

A Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. See Section 6.1.2, “Frontier”.

"Holding Jobs" vs. "Crawling Jobs"

The mode Crawling Jobs generally means that the crawler will start executing a job as soon as one is made available in the pending jobs queue (as long as there is not a job already running).

If the crawler is in the Holding Jobs mode, jobs added to the pending jobs queue will be held; they will not be started, even if there are no jobs currently being run.

Host

A host can serve multiple domains or a single domain can be served by multiple hosts. For our purposes so far, host == hostname in URI. DNS is not considered; it is volatile and may be unavailable. So when Heritrix gets the URIs...

  http://www.example.com
  http://search.example.com
  http://201.199.7.15
...even if they all point to the 201.199.7.15 IP, they are 3 different logical hosts (at the level of the URI/HTTP protocol).

Conformant HTTP proxies behave similarly, we think, even if they know www.example.com == 201.199.7.15, they will not consider them interchangeable.

This is not ideal for politeness where we'd want politeness rules to apply to the physical host rather than the logical.

Link hop count

Number of link follow from the seed to the current URI. Seeds have a link hop count of 0.

This number is equal to counting the 'L's in a URIs discovery path.

Pending URIs

Number of URIs that are awaiting detailed processing.

Number of discovered URIs that have not been inspected for scope or duplicates. Depending on the implementation of the Frontier this might always be zero. It may also be an adjusted number that tries to account for duplicates by estimation.

Politeness

Politeness refers to attempts by the crawler software to limit load on a site. Without politeness restrictions the crawler might otherwise overwhelm smaller sites and even cause moderately sized sites to slow down significantly.

Unless you have express permission to crawl a site aggressively you should apply strict politeness rules to any crawl.

Queued URIs

Number of URIs queued up and waiting for processing.

This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed.

Regular expressions

All regular expressions used by Heritrix are Java regular expressions.

Java regular expressions differ from those used in Perl, for example, in several ways. For detailed info on Java regular expressions see the Java API for java.util.regex.Pattern on Sun's home page (java.sun.com).

For API of Java SE v1.4.2 see http://java.sun.com/j2se/1.4.2/docs/api/index.html. It is recommended you lookup the API for the version of Java that is being used to run Heritrix.

Server

A server is a service on a Host. There might be more than one service on a host differentiated by port number.

Status codes

Each crawled URI gets a status code. This code (or number) is an indication of what happened when Heritrix tried to fetch the URI.

Codes ranging from 200 to 599 are standard HTTP response codes and information about their meanings is available at the World Wide Web consortium's web page.

Other status codes used by Heritrix (From org.archive.crawler.datamodel.FetchStatusCodes):

   Code Meaning
      1 Successful DNS lookup
      0 Fetch never tried (perhaps protocol unsupported or illegal URI)
     -1 DNS lookup failed
     -2 HTTP connect failed
     -3 HTTP connect broken
     -4 HTTP timeout (before any meaningful response received)
     -5 Unexpected runtime exception; see runtime-errors.log
     -6 Prerequisite domain-lookup failed, precluding fetch attempt
     -7 URI recognized as unsupported or illegal
     -8 Multiple retries all failed, retry limit reached
    -50 Temporary status assigned URIs awaiting preconditions; appearance in
        logs may be a bug
    -60 Failure status assigned URIs which could not be queued by the 
        Frontier (and may in fact be unfetchable)
    -61 Prerequisite robots.txt-fetch failed, precluding a fetch attempt
    -62 Some other prerequisite failed, precluding a fetch attempt
    -63 A prerequisite (of any type) could not be scheduled, precluding a 
        fetch attempt
  -3000 Severe Java 'Error' conditions (OutOfMemoryError, StackOverflowError,
        etc.) during URI processing.
  -4000 'chaff' detection of traps/content of negligible value applied
  -4001 Too many link hops away from seed
  -4002 Too many embed/transitive hops away from last URI in scope
  -5000 Out of scope upon reexamination (only happens if scope changes during 
        crawl)
  -5001 Blocked from fetch by user setting
  -5002 Blocked by a custom processor
  -5003 Blocked due to exceeding an established quota
  -5004 Blocked due to exceeding an established runtime
  -6000 Deleted from Frontier by user
  -7000 Processing thread was killed by the operator (perhaps because of a
        hung condition)
  -9998 Robots.txt rules precluded fetch

Note

Codes and explainations are also available under the Help link in the web UI.

Please note that status codes defined by Heritrix may be subject to change between versions, especially new codes may be added to tackle a wider array of situations.

SURT

SURT stands for Sort-friendly URI Reordering Transform, and is a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names.

A URI <scheme://domain.tld/path?query> has SURT form <scheme://(tld,domain,)/path?query>.

Conversion to SURT form also involves making all characters lowercase, and changing the 'https' scheme to 'http'. Further, the '/' after a URI authority component -- for example, the third slash in a regular HTTP URI -- will only appear in the SURT form if it appeared in the plain URI form. (This convention proves important when using real URIs as a shorthand for SURT prefixes, as described below.)

SURT form URIs are typically not used to specify exact URIs for fetching. Rather, SURT form is useful when comparing or sorting URIs. For example, URIs in SURT format sort into natural groups -- all 'archive.org' URIs will be adjacent, regardless of what subdomains like 'books.archive.org' or 'movies.archive.org' are used.

Most importantly, a SURT form URI, or a truncated version of a SURT form URI, can be used as a SURT prefix. A SURT prefix will often correspond to all URIs within a common 'area' of interest for crawling. For example, the prefix <http://(is,> will be shared by all URIs in the '.is' top-level domain.

SURT prefix

A URI in SURT form, especially if truncated, may be of use as a "SURT prefix", a shared prefix string of all SURT form URIs in the same 'area' of interest for web crawling.

For example, the prefix <http://(is,> will be shared by all SURT form URIs in the '.is' top-level domain. The prefix <http://(org,archive,www,)/movies> (which is also a valid full SURT form URI) will be shared by all URIs at www.archive.org with a path beginning '/movies'.

A collection of sorted SURT prefixes is an efficient way to specify a desired crawl scope: any URI whose SURT form starts with any of the prefixes should be included.

A small set of conventions can be also be used to calculate an "implied SURT prefix" from a regular URI, such as a URI supplied as a crawl seed. These conventions are:

  1. Convert the URI to its SURT form.

  2. If there are at least 3 slashes ('/') in the SURT form, remove everything after the last slash. As examples, <http://(org,example,www,)/main/subsection/> is unchanged; <http://(org,example,www,)/main/subsection> is truncated to <http://(org,example,www,)/main/>; <http://(org,example,www,)/> is unchanged; and <http://(org,example,www,)> is unchanged.

  3. If the resulting form ends in an off-parenthesis (')'), remove the off-parenthesis. So each of the above examples except for the last is unchanged, while the last <http://(org,example,www,)> becomes <http://(org,example,www,>.

This allows many seed URIs, in their usual form, to imply the most useful SURT prefixes for crawling related URIs -- with the presence or absence of a trailing '/' on URIs without further path-info being a subtle indicator as to whether subdomains of the supplied domain should be included.

For example, seed <http://www.archive.org/> will become SURT form and implied SURT prefix <http://(org,archive,www,)/>, and is the prefix of all SURT form URIs on www.archive.org. However, any subdomain URI like <http://homepages.www.archive.org/directory> would be ruled out, because its SURT form <http://(org,archive,www,homepages,)/directory> does not begin with the full SURT prefix, including the ')/', deduced from the seed.

In contrast, seed <http://www.archive.org> (note the lack of trailing slash) will become SURT form <http://(org,archive,www,)>, and implied SURT prefix <http://(org,archive,www,> (note the lack of trailing ')'). This will be the prefix of all URIs on www.archive.org, as well as any subdomain URIs like <http://homepages.www.archive.org/directory>, because the full SURT prefix appears in subdomain URI SURT forms.

Toe Threads

When crawling Heritrix employs a configurable number of Toe Threads to process each URI.

Each of these threads will request a URI from the Frontier (Section 6.1.2, “Frontier”), apply each of the set Processors (Section 6.1.3, “Processing Chains”) to it and finally report it as completed back to the Frontier.