8. Release 1.10.2 - 2007-01-15

8. Release 1.10.2 - 2007-01-15
Prev		Next

Abstract

This is primarily a bug-fix release, with a couple of new features, provided before a number of significant changes to the Heritrix project that will require developer and crawl operator adjustments. Post-1.10.2, Heritrix source code control, issue tracking, and build process will migrate to new systems. Also, updates to core classes, especially with regard to the settings architecture, will noticeably break backward compatibility with 1.10.2 and prior crawler settings files and formats.

8.1. Contributors

Olaf Freyer
Max Schöfmann

8.2. Changes

8.2.1. Jericho HTML Extractor

Olaf Freyer has contributed an HTML Extractor named JerichoExtractorHTML based on the Jericho HTML Parser. Following is a quote from the JerichoExtractorHTML class comment describing how the new Extractor differs from ExtractorHTML, its advantages and downsides: “ This extractor extends ExtractorHTML and mimics its workflow - but has some substantial differences when it comes to internal implementation. Instead of heavily relying upon java regular expressions it uses a real html parser library - namely Jericho HTML Parser (http://jerichohtml.sourceforge.net). Using this parser it can better handle broken html (i.e. missing quotes) and also offer improved extraction of HTML form URLs (not only extract the action of a form, but also its default values). Unfortunately this parser also has one major drawback - it has to read the whole document into memory for parsing, thus has an inherent OOME risk. This OOME risk can be reduced/eleminated by limiting the size of documents to be parsed (i.e. using NotExceedsDocumentLengthTresholdDecideRule). Also note that this extractor seems to have a lower overall memory consumption compared to ExtractorHTML. (still to be confirmed on a larger scale crawl) ”

Table 1. All Tracked Changes

ID	Type	Summary	Open Date	By	Filer
913002	Add	Make ExtractorHTML aggressiveness configurable	2004-03-09	gojomo	gojomo
1573708	Add	[Contrib] JerichoExtractorHTML	2006-10-09	nobody	pandae
1633458	Add	[arcreader] Support for s3 and streaming improvements	2007-01-11	stack	stack
1629242	Fix	filehandle leak: ReplayInputStream/BufferedSeekInputStream	2007-01-05	karl-ia	gojomo
1218961	Fix	"failed get of replay" in ExtractorHTML... usu: UTF-16BE	2005-06-11	karl-ia	gojomo
996161	Fix	Fix DNSJava issues (memory)	2004-07-22	karl-ia	gojomo
1477371	Fix	ExtractorDOC wants whole doc in memory	2006-04-26	paul_jack	gojomo
1618928	Fix	Do not allow http:/ and https:/ urls	2006-12-19	stack-sf	stack-sf
1596176	Fix	NotMatchesListRegExpDecideRule extends wrong class	2006-11-14	nobody	pandae
1593540	Fix	NPE in quotaEnforcer.checkQuotas	2006-11-09	nobody	svc
1587413	Fix	[PATCH] Webapp doesn't find profiles and ignores jobsdir	2006-10-30	nobody	nobody
1572391	Fix	SURTs for IP-address URIs unhelpful	2006-10-06	gojomo	gojomo
1501810	Fix	NPE in FetchHTTP.saveCookies	2006-06-06	gojomo	stack-sf
1633117	Fix	Useragent compare because of case in RobotsExclusionPolicy	2007-01-11	stack-sf	stack-sf

Prev		Next
7. Release 1.12.0 - 2007-03-16	Home	9. Release 1.10.1 - 2006-09-27