Abstract
This is primarily a bug-fix release, with a couple of new features, provided before a number of significant changes to the Heritrix project that will require developer and crawl operator adjustments. Post-1.10.2, Heritrix source code control, issue tracking, and build process will migrate to new systems. Also, updates to core classes, especially with regard to the settings architecture, will noticeably break backward compatibility with 1.10.2 and prior crawler settings files and formats.
Olaf Freyer has contributed an HTML Extractor named JerichoExtractorHTML based on the Jericho HTML Parser. Following is a quote from the JerichoExtractorHTML class comment describing how the new Extractor differs from ExtractorHTML, its advantages and downsides: “ This extractor extends ExtractorHTML and mimics its workflow - but has some substantial differences when it comes to internal implementation. Instead of heavily relying upon java regular expressions it uses a real html parser library - namely Jericho HTML Parser (http://jerichohtml.sourceforge.net). Using this parser it can better handle broken html (i.e. missing quotes) and also offer improved extraction of HTML form URLs (not only extract the action of a form, but also its default values). Unfortunately this parser also has one major drawback - it has to read the whole document into memory for parsing, thus has an inherent OOME risk. This OOME risk can be reduced/eleminated by limiting the size of documents to be parsed (i.e. using NotExceedsDocumentLengthTresholdDecideRule). Also note that this extractor seems to have a lower overall memory consumption compared to ExtractorHTML. (still to be confirmed on a larger scale crawl) ”
Table 1. All Tracked Changes
ID | Type | Summary | Open Date | By | Filer |
---|---|---|---|---|---|
913002 | Add | Make ExtractorHTML aggressiveness configurable | 2004-03-09 | gojomo | gojomo |
1573708 | Add | [Contrib] JerichoExtractorHTML | 2006-10-09 | nobody | pandae |
1633458 | Add | [arcreader] Support for s3 and streaming improvements | 2007-01-11 | stack | stack |
1629242 | Fix | filehandle leak: ReplayInputStream/BufferedSeekInputStream | 2007-01-05 | karl-ia | gojomo |
1218961 | Fix | "failed get of replay" in ExtractorHTML... usu: UTF-16BE | 2005-06-11 | karl-ia | gojomo |
996161 | Fix | Fix DNSJava issues (memory) | 2004-07-22 | karl-ia | gojomo |
1477371 | Fix | ExtractorDOC wants whole doc in memory | 2006-04-26 | paul_jack | gojomo |
1618928 | Fix | Do not allow http:/ and https:/ urls | 2006-12-19 | stack-sf | stack-sf |
1596176 | Fix | NotMatchesListRegExpDecideRule extends wrong class | 2006-11-14 | nobody | pandae |
1593540 | Fix | NPE in quotaEnforcer.checkQuotas | 2006-11-09 | nobody | svc |
1587413 | Fix | [PATCH] Webapp doesn't find profiles and ignores jobsdir | 2006-10-30 | nobody | nobody |
1572391 | Fix | SURTs for IP-address URIs unhelpful | 2006-10-06 | gojomo | gojomo |
1501810 | Fix | NPE in FetchHTTP.saveCookies | 2006-06-06 | gojomo | stack-sf |
1633117 | Fix | Useragent compare because of case in RobotsExclusionPolicy | 2007-01-11 | stack-sf | stack-sf |