Heritrix developer documentation

Internet Archive

Edited by

John Erik Halse

Gordon Mohr

Kristinn Sigurđsson

Michael Stack

Paul Jack

Table of Contents

1. Introduction
2. Obtaining and building Heritrix
2.1. Obtaining Heritrix
2.2. Building Heritrix
2.3. Running Heritrix
2.4. Eclipse
2.5. Integration self test
3. Coding conventions
3.1. Tightenings on the SUN conventions
3.2. Long versus int
3.3. Unit tests code in same package
3.4. Log Message Format
4. Overview of the crawler
4.1. The CrawlController
4.2. The Frontier
4.3. ToeThreads
4.4. Processors
5. Settings
5.1. Settings hierarchy
5.2. ComplexType hierarchy
6. Common needs for all configurable modules
6.1. Definition of a module
6.2. Accessing attributes
6.3. Putting together a simple module
7. Some notes on the URI classes
7.1. Supported Schemes (UnsupportedUriSchemeException)
7.2. The CrawlURI's Attribute list
7.3. The recorder streams
8. Writing a Frontier
9. Writing a Filter
10. Writing a Scope
11. Writing a Processor
11.1. Accessing and updating the CrawlURI
11.2. The HttpRecorder
11.3. An example processor
11.4. Things to keep in mind when writing a processor
12. Writing a Statistics Tracker
12.1. AbstractTracker
12.2. Provided StatisticsTracker
13. Internet Archive ARC files
13.1. ARC File Naming
13.2. Reading arc files
13.3. Writing arc files
13.4. Searching ARCS
A. Future changes in the API
1. The org.archive.util.HTTPRecorder class
2. The Frontiers handling of dispositions
B. Version and Release Numbering
C. Making a Heritrix Release
D. Settings XML Schema
E. Profiling Heritrix