Heritrix - Glossary of Heritrix Terms

Checkpointing

Heritrix currently has no checkpointing facility. However, our conception of what we need is heavily influenced by what Mercator provided. In one of the papers on Mercator it is described this way:

Checkpointing is an important part of any long-running process such as
a web crawl. By checkpointing we mean writing a representation of the crawler's
state to stable storage that, in the event of a failure, is sufficient to allow
the crawler to recover its state by reading the checkpoint and to resume
crawling from the exact state it was in at the time of the checkpoint. By this
definition, in the event of a failure, any work performed after the most recent
checkpoint is lost, but none of the work up to the most recent checkpoint. In
Mercator, the frequency with which the background thread performs a checkpoint
is user-configurable; we typically checkpoint anywhere from 1 to 4 times per
day.

Frontier

A Frontier is a plug-able module in Heritrix that maintains the internal state of the crawl. See the URIFrontier Interface javadoc comments for more on what a frontier does.

A host can serve multiple domains or a single domain can be served by multiple hosts. For our purposes so far, host == hostname in URI DNS is not considered; it is volatile and may be unavailable. So when we get the URIs...

    http://www.example.com
    http://search.example.com
    http://201.199.7.15

...even if they all point to the 201.199.7.15 IP, they are 3 different logical hosts (at the level of the URI/HTTP protocol). Conformant HTTP proxies behave similarly, we think, even if they know www.example.com == 201.199.7.15, they will not consider them interchangeable.

This is not ideal for politeness where we'd want politeness rules to apply to the physical host rather than the logical.

CrawlHost is used to represent our notion of a Host (whereas CrawlServer is used to represent a Server on that host: e.g. there would be a CrawlHost for yahoo.com and one CrawlServer instance for the yahoo.com:80 and another for yahoo.com:443.