7. Some notes on the URI classes

URIs[1] in Heritrix are represented by several classes. The basic building block is org.archive.datamodel.UURI which subclasses org.apache.commons.httpclient.URI. "UURI" is an abbrieviation for "Usable URI." This class always normalizes and derelativizes URIs -- implying that if a UURI instance is successfully created, the represented URI will be, at least on its face, "usable" -- neither illegal nor including superficial variances which complicate later processing. It also provides methods for accessing the different parts of the URI.

We used to base all on java.net.URI but because of bugs and its strict RFC2396 conformance in the face of a real world that acts otherwise, its facility was was subsumed by UURI.

Two classes wrap the UURI in Heritrix:

CandidateURI

A URI, discovered or passed-in, that may be scheduled (and thus become a CrawlURI). Contains just the fields necessary to perform quick in-scope analysis. This class wraps an UURI instance.

CrawlURI

Represents a candidate URI and the associated state it collects as it is crawled. The CrawlURI is an subclass of CandidateURI. It is instances of this class that is fed to the processors.

7.1. Supported Schemes (UnsupportedUriSchemeException)

A property in heritrix.properties named org.archive.crawler.datamodel.UURIFactory.schemes lists supported schemes. Any scheme not listed here will cause an UnsupportedUriSchemeException which will be reported in uri-errors.log with a 'Unsupported scheme' prefix. If you add a fetcher for a scheme, you'll need to add to this list of supported schemes (Later, Heritrix can ask its fetchers what schemes are supported).

7.2. The CrawlURI's Attribute list

The CrawlURI offers a flexible attribute list which is used to keep arbitrary information about the URI while crawling. If you are going to write a processor you almost certainly will use the attribute list. The attribute list is a key/value-pair list accessed by typed accessors/setters. By convention the key values are picked from the CoreAttributeConstants interface which all processors implement. If you use other keys than those listed in this interface, then you must add the handling of that attribute to some other module as well.

7.3. The recorder streams

The crawler wouldn't be of much use if it did not give access to the HTTP request and response. For this purpose the CrawlURI has the getHttpRecorder() [2] method. The recorder is referenced by the CrawlURI and available to all the processors. See Section 11, “Writing a Processor” for an explanation on how to use the recorder.



[1] URI (Uniform Resource Identifiers) defined by RFC 2396 is a way of identifying resources. The relationship between URI, URL and URN is described in the RFC: ["A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. The term "Uniform Resource Name" (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable."] Although Heritrix uses URIs, only URLs are supported at the moment. For more information on naming and addressing of resources see: Naming and Addressing: URIs, URLs, ... on w3.org's website.

[2] This method will most likely change name see Section 1, “The org.archive.util.HTTPRecorder class” for details.