Heritrix developer documentation

Heritrix developer documentation
		Next

Heritrix developer documentation

Internet Archive

Edited by

John Erik Halse

Gordon Mohr

Kristinn Sigurđsson

Michael Stack

Paul Jack

Table of Contents

1. Introduction

2. Obtaining and building Heritrix

2.1. Obtaining Heritrix
2.2. Building Heritrix
2.3. Running Heritrix
2.4. Eclipse
2.5. Integration self test

3. Coding conventions

3.1. Tightenings on the SUN conventions
3.2. Long versus int
3.3. Unit tests code in same package
3.4. Log Message Format

4. Overview of the crawler

4.1. The CrawlController
4.2. The Frontier
4.3. ToeThreads
4.4. Processors

5. Settings

5.1. Settings hierarchy
5.2. ComplexType hierarchy

6. Common needs for all configurable modules

6.1. Definition of a module
6.2. Accessing attributes
6.3. Putting together a simple module

7. Some notes on the URI classes

7.1. Supported Schemes (UnsupportedUriSchemeException)
7.2. The CrawlURI's Attribute list
7.3. The recorder streams

8. Writing a Frontier

9. Writing a Filter

10. Writing a Scope

11. Writing a Processor

11.1. Accessing and updating the CrawlURI
11.2. The HttpRecorder
11.3. An example processor
11.4. Things to keep in mind when writing a processor

12. Writing a Statistics Tracker

12.1. AbstractTracker
12.2. Provided StatisticsTracker

13. Internet Archive ARC files

13.1. ARC File Naming
13.2. Reading arc files
13.3. Writing arc files
13.4. Searching ARCS

A. Future changes in the API

1. The org.archive.util.HTTPRecorder class
2. The Frontiers handling of dispositions

B. Version and Release Numbering

C. Making a Heritrix Release

D. Settings XML Schema

E. Profiling Heritrix

Bibliography

		Next
		1. Introduction