Package org.archive.crawler

Introduction to Heritrix.

See:
          Description

Class Summary
CommandLineParser Print Heritrix command-line usage message.
Heritrix Main class for Heritrix crawler.
SimpleHttpServer Wrapper for embedded Jetty server.
WebappLifecycle Calls start and stop of Heritrix when Heritrix is bundled as a webapp.
 

Package org.archive.crawler Description

Introduction to Heritrix.

Heritrix is designed to be easily extensible via 3rd party modules.

Architecture

The software is divided into several packages of varying importance. The relationship between them will be covered in some greater depth after their introductions.

The root package (this) contains the executable class Heritrix. That class will load the crawler, parsing command line arguments. If a WUI is to be launched it will launch it. It can also start jobs (with or without the WUI) that are specified in command line options.

framework

org.archive.crawler.framework

The framework package contains most of the core classes for running a crawl. It also contains a number of Interfaces for extensible items, the implementatations of whom can be found in other classes.

Heritrix is in effect divided into two types of classes.

  1. Core classes - these can often be configured but not replaced.
  2. Pluggable classes - these must implment a given interface or extend a specific class but 3rd parties can introduce their own implementations.
The framework thus contains a selection of the core classes and a number of the Interfaces and base classes for the pluggable classes.

datamodel

org.archive.crawler.datamodel

Contains various classes that make up the crawlers data structure. Including such essentials as the CandidateURI and CrawlURI classes that wrap the discovered URIs for processing.

admin

org.archive.crawler.admin

The admin package contains classes that are used by the Web UI. This includes some core classes and a specific implementation of the Statistics Tracking interface found in the framework package that is designed to provide the UI with information about ongoing crawls.

Pluggable modules

The following is a listing of the types of pluggable modules found in Heritrix with brief explanations of each and linking to their respective API documentation.

Frontier

A Frontier maintains the internal state of a crawl while it is in progress. What URIs have been discovered, which should be crawled next, etc.

Needless to say this is one of the most important modules in any crawl and the provided implementation should generally be appropriate unless a very different strategy for ordering URIs for crawling is desired.

Frontier is the interface that all Frontiers must implement.
org.archive.crawler.frontier package contains the provided implementation of a Frontier along with it's supporting classes.

Processor

Processing Steps

When a URI is crawled, a ToeThread will execute a series of processors on it.

The processors are split into 5 distinct chains that are exectued in sequence:

  1. Pre-fetch processing chain
  2. Fetch processing chain
  3. Extractor processing chain
  4. Write/Index processing chain
  5. Post-processing chain
Each of these chains contain any number of processors. The processors all inherit from a generic Processor. While the processors are divided into the five categories above that is strictly a high level configuration and any processor can be in any chain (although doing link extraction before fetching a document is clearly of no use).

Numerous processors are provided with Heritrix in the following packages:
org.archive.crawler.prefetch package contains processors run before the URI is fetched from the Internet.
org.archive.crawler.fetcher package contains processors that fetch URI from the Internet. Typically each processor handles a different protocol.
org.archive.crawler.extractor package contains processors that perform link extractions on various document types.
org.archive.crawler.writer package contains a processor that writes an ARC file with the fetched document.
org.archive.crawler.postprocessor package contain processors that do wrapup on the processing, reporting links back to the Frontier etc.

Filter

Scope

Scopes are special filters that are applied to the crawl as a whole to define it's scope. Any given crawl will employ exactly one scope object to define what URIs are considered 'within scope'.

Several implementations covering the most commonly desired scopes are provided (broad, domain, host etc.). However custom implementations can be made of these to define any arbitrary scope. It should be noted though that usually any type of limitations to the scope of a crawl can be more easily achived using one of the existing scopes and modifing it with appropriate filters.

CrawlScope - Base class for scopes.
org.archive.crawler.scope package. Contains provided scopes.

Statistics Tracking

Any number of statistics tracking modules can be added to a crawl to gather run time information about it's progress.

These modules can both interrogate the Frontier for what sparse date it exposes but they can also subscribe to Crawled URI Disposition events to monitor the completion of each URI that is processed.

An interface for statistics tracking is provided as well as a partial implementation (AbstractTracker) that does much of the work common to most statistics tracking modules.

Furthermore the admin package implements a statistics tracking module (StatisticsTracker) that generates a log of the crawlers progress as well as providing information that the UI uses. It also compiles end-of-crawl reports that contain all of the information it has gathered in the course of the crawl.
It is highly recommended that it always be used when running crawls via the UI.



Copyright © 2003-2011 Internet Archive. All Rights Reserved.