Release 1.12.0 is the first of several planned releases enhancing Heritrix with "smart crawler" functionality. In this release, the theme has been offering new options to reduce the amount of duplicate content crawled and stored when recrawling sites at regular intervals. A number of other enhancements and bug fixes are also included.
Aside from the usual suspects, the following contributed to this release:
With this release, Heritrix project issue-tracking has moved from Sourceforge to a JIRA-based system at http://webteam.archive.org/jira/browse/HER .
See Section 11.1.1, “java.io.IOException: No locks available” in 1.8.0 Release Notes.
Checkpoints from earlier versions are generally not supported for resume in later versions.
Crawler configuration files from jobs in previous versions may work in 1.12.0, though missing new settings will be set to their default values, and obsolete old settings will generate log warnings. Re-creating configurations from defaults or hand-editting to match newer files is recommended.
A collection of Processors, including the FetchHistoryProcessor, PersistProcessor, and its subclasses, may be used together with new options on the FetchHTTP and writer processors to carry information forward between crawls and collect less duplicate content on later recrawls. The project wiki features notes on using the new duplication-reduction functionality.
All Processors which used internal Filters for differentially acting on URIs now use DecideRules instead. In those cases where a DecideRule replacement for a Filter is not yet available, a legacy Filter can be wrapped in a FilterDecideRule to preserve prior functionality. In a future release, all Filters will be removed in favor of equivalent DecideRules.
ExperimentalWARCWriter has been updated to match proposed WARC version "WARC/0.12" (revision H1.12-RC1). The implementation as of Heritrix 1.10.x remains for reference as org.archive.io.warc.v10.ExperimentalV10WARCWriterProcessor. The WARC format remains under discussion.
Oskar Grenholm of the Swedish National Library has contributed a module that writes the results of successful fetches to files on disk. These files are MIME-files of the type used by the Swedish National Library's Kulturarw3 web harvesting.
A dynamic list of all tracked changes marked as fixed in 1.12.0 is available at: Issues with 'Fix Version' 1.12.0.