GlossaryCheckpointingHeritrix currently has no checkpointing facility. However, our conception of what we need is heavily influenced by what Mercator provided. In one of the papers on Mercator it is described this way: Checkpointing is an important part of any long-running process such as a web crawl. By checkpointing we mean writing a representation of the crawler's state to stable storage that, in the event of a failure, is sufficient to allow the crawler to recover its state by reading the checkpoint and to resume crawling from the exact state it was in at the time of the checkpoint. By this definition, in the event of a failure, any work performed after the most recent checkpoint is lost, but none of the work up to the most recent checkpoint. In Mercator, the frequency with which the background thread performs a checkpoint is user-configurable; we typically checkpoint anywhere from 1 to 4 times per day. FrontierA Frontier is a plug-able module in Heritrix that maintains the internal state of the crawl. See the URIFrontier Interface javadoc comments for more on what a frontier does. HostA host can serve multiple domains or a single domain can be served by multiple hosts. For our purposes so far, host == hostname in URI DNS is not considered; it is volatile and may be unavailable. So when we get the URIs... http://www.example.com http://search.example.com http://201.199.7.15 This is not ideal for politeness where we'd want politeness rules to apply to the physical host rather than the logical.
CrawlHost
is used to represent our notion of a
Host
(whereas
CrawlServer
is used to represent a
Server
on that host: e.g. there would be a
|