7. Release 1.12.0 - 2007-03-16

Abstract

Release 1.12.0 is the first of several planned releases enhancing Heritrix with "smart crawler" functionality. In this release, the theme has been offering new options to reduce the amount of duplicate content crawled and stored when recrawling sites at regular intervals. A number of other enhancements and bug fixes are also included.

7.1. Contributors

Aside from the usual suspects, the following contributed to this release:

  • Oskar Grenholm

  • Doug Judd

7.2. Notes

With this release, Heritrix project issue-tracking has moved from Sourceforge to a JIRA-based system at http://webteam.archive.org/jira/browse/HER .

Those using Heritrix in a Hadoop environment may be interested in Doug Judd's HDFSWriterProcessor, for storing crawled content directly into HDFS, the Hadoop Distributed FileSystem.

7.3. Known Limitations/Issues

7.3.1. java.io.IOException: No locks available

See Section 11.1.1, “java.io.IOException: No locks available” in 1.8.0 Release Notes.

7.3.2. Older Checkpoints

Checkpoints from earlier versions are generally not supported for resume in later versions.

7.3.3. Older configurations (order.xml, etc.)

Crawler configuration files from jobs in previous versions may work in 1.12.0, though missing new settings will be set to their default values, and obsolete old settings will generate log warnings. Re-creating configurations from defaults or hand-editting to match newer files is recommended.

7.4. Changes

7.4.1. Duplication reduction features

A collection of Processors, including the FetchHistoryProcessor, PersistProcessor, and its subclasses, may be used together with new options on the FetchHTTP and writer processors to carry information forward between crawls and collect less duplicate content on later recrawls. The project wiki features notes on using the new duplication-reduction functionality.

7.4.2. DecideRules have replaced Filters on Processors

All Processors which used internal Filters for differentially acting on URIs now use DecideRules instead. In those cases where a DecideRule replacement for a Filter is not yet available, a legacy Filter can be wrapped in a FilterDecideRule to preserve prior functionality. In a future release, all Filters will be removed in favor of equivalent DecideRules.

7.4.3. WARC

ExperimentalWARCWriter has been updated to match proposed WARC version "WARC/0.12" (revision H1.12-RC1). The implementation as of Heritrix 1.10.x remains for reference as org.archive.io.warc.v10.ExperimentalV10WARCWriterProcessor. The WARC format remains under discussion.

7.4.4. Kw3WriterProcessor

Oskar Grenholm of the Swedish National Library has contributed a module that writes the results of successful fetches to files on disk. These files are MIME-files of the type used by the Swedish National Library's Kulturarw3 web harvesting.

7.4.5. All tracked changes

A dynamic list of all tracked changes marked as fixed in 1.12.0 is available at: Issues with 'Fix Version' 1.12.0.