12. Release 1.6.0 - 2005-12-01

Abstract

Release 1.6.0 offers improved remote control and monitoring via JMX, a crawl-checkpointing facility, and experimental support for bloom filter already-included testing, partitioning a crawl across multiple independent crawlers, and per-host/domain/queue-grouping collection quotas. Performance and stability in large crawls is also improved. Among tracked issues, it includes 39 requested enhancements and fixes 96 reported bugs.

12.1. Known Limitations/Issues

12.1.1. java.io.IOException: No locks available

BDB will complain 'No locks available' when crawler is being built/run on an NFS mount. Workaround is not run on an NFS-mounted volume.

12.1.2. OutOfMemoryError in 64bit JVMs

BDB 2.0.90 can overgrow its intended cache size due to a misestimation of instance sizes under 64bit Java VMs, which may be a major contributor to early Heritrix OutOfMemoryError problems on 64bit systems. A workaround is to cut the assigned percentage by 1/3 to 1/2. For example, change the 'bdb-cache-percent' setting to '40' or '30' (instead of the default 60% when no value is set here).

12.2. Changes

12.2.1. Postselector

The Postselector has been refactored out of existence. Its responsibilities have been parcelled out to two new Processors: LinksScoper and FrontierScheduler. LinksScoper is responsible for scope checking of extracted links. FrontierScheduler does the scheduling of URIs with the Frontier.

This change was done to allow introduction of processors between scope checking and Frontier scheduling steps.

Because of this change, order files from 1.4.0 Heritrix or before will need to be updated -- Postselector references replaced by LinkScoper and FrontierScheduler references -- before they can be used with Heritrix 1.6.0 (Referencing a non-existent Postselector in an order file usually shows as -50 fetch status in crawl.log).

12.2.2. Web Console

The layout and terminology of the web Console and header have been changed, and new readouts added. Most notably, "Crawler Status" and "Job Status" information have been moved to separate boxes, with the controls for each at the top of their respective boxes, near the current status information. Also, the "Crawling"/"Stopped" distinction in the crawler -- whether available pending jobs would be started as possible -- has been renamed "Crawling Jobs"/"Holding Jobs" for clarity.

Table 5. All Tracked Changes

IDTypeSummaryOpen DateByFiler
806831 AddXMLExtractor (XML/RSS)2003-09-15gojomogojomo
983051 Addannotate what robots.txt would have precluded2004-06-30karl-iagojomo
1069331 Addhold paused crawl at 'end', allowing all in-progress ops2004-11-19karl-iagojomo
1081774 Addneed way to delete overrides2004-12-08karl-iagojomo
1104696 AddConfusion: CrawlController and CrawlJob States2005-01-18nobodystack-sf
1108006 Addalerts should show current processor2005-01-23gojomogojomo
1108520 AddSURT needs facelift2005-01-24gojomostack-sf
1119616 AddDecompose Postselector to Scoping and Scheduling components2005-02-09stack-sfgojomo
1122692 Add[contribution] New fixed number of queues policy2005-02-14stack-sfstack-sf
1173597 Addjmx api additions2005-03-30stack-sfstack-sf
1176934 Add[contrib] Generalize/Refactor BDB Frontier2005-04-05stack-sfck-heritrix
1180630 Add[contrib] UI stacktrace dump (Depends on JDK150)2005-04-11stack-sfck-heritrix
1183376 AddPost 1.4 Deprecate filter scope and remove post 1.6.2005-04-14stack-sfstack-sf
1190974 AddQuick resume without real recovery / Checkpointing2005-04-27karl-iack-heritrix
1196602 Add[contrib] Show estimated remaining time2005-05-06stack-sfck-heritrix
1200205 Addadd 'exhausted' queue count to frontier report2005-05-11gojomogojomo
1204644 Addadd 'memory used' to progress-statistics.log2005-05-18kristinn_siggojomo
1205583 Addadd CandidateURI parameter to UriUniqFilter.forget()2005-05-20stack-sfck-heritrix
1207866 Add[contrib] ThreadLocal-version of TextUtil.getMatcher2005-05-24gojomock-heritrix
1207898 Add[contrib] WorkQueueFrontier: Store allQueues in RAM if poss.2005-05-24stack-sfck-heritrix
1208293 AddList based URIRegExprFilter2005-05-25kristinn_sigkristinn_sig
1208510 Add[rfe-contrib] Add Stacktrace dump to ToeThread.report()2005-05-25stack-sfck-heritrix
1208747 AddCrawlURI serialization bloated; should be slimmed2005-05-25gojomogojomo
1208757 AddCookies are thread traffic jam and memory hog2005-05-25gojomostack-sf
1208770 Addgarbage hot spot: SerialBinding & FastOutputStream.bump()2005-05-25gojomogojomo
1211217 Add[contrib] Add debugging aid for BDB RuntimeExceptionWrapper2005-05-30stack-sfck-heritrix
1217854 Addseed report of redirect should show where to2005-06-09karl-iagojomo
1222764 AddRotation of crawl logs2005-06-17karl-iastack-sf
1223840 AddBdbWorkQueue origins should be based on full classKey2005-06-19gojomogojomo
1225597 AddExpose the bdb je 2.0 jmx interface2005-06-22karl-iastack-sf
1225729 Addnew alreadyIncluded option: Bloom filter based2005-06-22gojomogojomo
1254560 AddAdd to queue-assignment-policy without compile2005-08-08karl-iastack-sf
1260360 AddMore than one Heritrix instance in a JVM instance2005-08-15karl-iastack-sf
1261506 AddMultimachine Crawl Splitter Processor2005-08-16gojomostack-sf
1262665 Addadd dummy items at heads of BDB queues for performance2005-08-17gojomogojomo
1302182 AddIP geolocation based scoping2005-09-23karl-iagojomo
1302208 Addper-crawler 'load' summary numbers2005-09-23karl-iagojomo
1325123 AddCheckpointing fixes/improvements2005-10-12stack-sfstack-sf
1329725 Add[uuri] When 'generous' mode, don't encode curly-brackets2005-10-18stack-sfstack-sf
965622 FixSerious error during crawling did not produce an alert2004-06-03gojomokristinn_sig
1000338 Fix[UURI] escaped absolute path not valid2004-07-29nobodystack-sf
1002356 Fixtiming issue on crawl-start & "run time" stat2004-08-02gojomogojomo
1059126 FixOther than default seeds path is not used2004-11-02nobodystack-sf
1060517 Fixhard-coding of job dir in state.job makes moves awkward2004-11-04nobodystack-sf
1062727 Fixangle-brackets in URIs thwart frontier report2004-11-08kristinn_siggojomo
1065413 Fix[uuri] '$' in path gets scheduled, spawns queueing error2004-11-12gojomostack-sf
1080926 Fixreducing max-toe-threads has no effect2004-12-07gojomogojomo
1083427 FixText incorrect in WUI when create new profile2004-12-11nobodyzhousp
1090564 Fixmax-trans-hops=0 generates -63 in crawl.log2004-12-23gojomostack-sf
1090916 FixRIS still open; ThreadLocalConnectionManager async close2004-12-24karl-iastack-sf
1113410 FixNPE readalert_jsp._jspService2005-01-31gojomostack-sf
1116456 FixARCWriter length is wrong (But coherent gzip record)2005-02-04karl-iastack-sf
1119644 Fixfrontier report ConcurrentModificationException2005-02-09gojomofrodobay
1122836 FixLocalize StackOverflowError in Extractors2005-02-14nobodygojomo
1181892 FixAggressive extraction of `for' attributes2005-04-12karl-iaia_igor
1187973 FixNPE in FetchDNS, caused by UURI2005-04-22nobodyck-heritrix
1192029 FixOOME guard against pages of thousands of links2005-04-28gojomostack-sf
1195312 FixWaitEvaluators accidentally removed from Processors.options2005-05-04nobodystack-sf
1196594 Fixminor CSS typo, wrong text font.2005-05-06nobodyck-heritrix
1196630 FixBuild w/ 150 jdk won't run under 14x.2005-05-06stack-sfstack-sf
1200957 FixWeb UI recovery mangles paths2005-05-12karl-iakarl-ia
1203235 Fixcan't change cost policy mid-crawl2005-05-16gojomogojomo
1203588 FixBdbFrontier, serious exception - LatchNotHeldException2005-05-17karl-iakristinn_sig
1203958 FixJMX remote command 'stop' leaves zombie crawler2005-05-17stack-sfkarl-ia
1204643 Fixuri-errors.log: old timestamps, too many errors2005-05-18gojomogojomo
1204667 Fix'custom' robots policy doesn't work2005-05-18gojomogojomo
1204931 FixNPE when viewing crawl report2005-05-19karl-iakristinn_sig
1207320 Fix[arcwriter] Record w/ empty body on OOME2005-05-23stack-sfstack-sf
1207378 Fixseeds listed without scheme, but with path, being ignored2005-05-23karl-iaia_igor
1208804 FixCachedBdbMap NPE killing off threads.2005-05-25karl-iastack-sf
1209046 FixFailed URIs should be 'free' (no cost against queue budget)2005-05-26gojomokristinn_sig
1209665 FixReplayCharSequenceFactory: Unexpected response body offset2005-05-27karl-iack-heritrix
1212377 FixURIException in deserialization, post CrawlURI slimming2005-05-31gojomogojomo
1213095 FixUURI handling of inconsistent escaping makes broken instance2005-06-01karl-iagojomo
1214478 FixThreadLocalHttpConnectionManager starts a non-daemon Thread2005-06-03nobodyck-heritrix
1216633 Fixrecovery fills heritrix_out with "Relative URI but no base..2005-06-07gojomogojomo
1217290 FixNon-canonical seed URLs need better reporting2005-06-08karl-iakarl-ia
1218019 FixGzippedInputStream class is not thread-safe2005-06-10nobodyck-heritrix
1218037 FixCookieSpec interface modification breaks IgnoreCookiesSpec2005-06-10gojomock-heritrix
1218283 FixARCReader: Bad URI escaping for tab character2005-06-10nobodyck-heritrix
1218958 Fixodd preencoded URIs generate error on deserialization2005-06-11nobodygojomo
1219259 Fixbroad crawls slow; most threads stuck retrying missing sites2005-06-12gojomogojomo
1219262 Fix'treat seed redirects as new seeds' not working2005-06-12karl-iagojomo
1219486 Fixno rule for decidingscope to always crawl seeds2005-06-12gojomogojomo
1219715 Fix[patch] Signature change broke BucketQueueAssignmentPolicy2005-06-13nobodyck-heritrix
1219854 FixNPE je-2.0 entryToObject(SerialBinding.java:82)2005-06-13karl-iastack-sf
1220714 FixExtractorHTML excessive temp strings / OOM2005-06-14karl-iagojomo
1221570 Fixreports (web ui and to disk) don't scale2005-06-15gojomogojomo
1222229 Fixunicode/idn domain names fail (seeds and more?)- punycode2005-06-16karl-iagojomo
1222360 Fixstrip-www canon. rule causes failed crawl of netarkivet.dk2005-06-16karl-iastack-sf
1224531 Fix'@' in URI path confuses SURT (bad queues, scope probs, etc)2005-06-20karl-iagojomo
1226365 FixDomainSensitiveFrontier broken by uri-included-structure2005-06-23karl-iastack-sf
1226387 FixBdbUriUniqFilter URI fingerprint somewhat collision prone2005-06-23karl-iagojomo
1226707 FixCandidateURI serialization 'decodes' UURI2005-06-23karl-iastack-sf
1230180 Fixnonstandard port URIs fail with '-50': Surt queue policy bug2005-06-30karl-iagojomo
1230188 FixDNS prereq problems (-50 fails/repeats): calculateInsertKey2005-06-30karl-iagojomo
1231123 FixIdentityCachingMapTest FAILED: cache not cleared appropriate2005-07-01gojomostack-sf
1232402 FixIdentityCachingMapTest fails on fedora2005-07-04nobodystack-sf
1232974 FixWorkQueueFrontier.kickUpdate ClassCastException / unretiring2005-07-05karl-iagojomo
1236094 FixPossible deadlock situation with ARFrontier2005-07-11kristinn_sigkristinn_sig
1236334 FixCannot set cachePercentage in bdbje JMX bean2005-07-11stack-sfstack-sf
1236635 FixLink ClassCastException2005-07-12kristinn_sigkristinn_sig
1239155 Fix[arcreader] Fails on records that only have headers2005-07-15karl-iastack-sf
1241851 FixConnection reset error with WebSTAR/3.0.2 web serve2005-07-20karl-iaia_igor
1242747 Fixover-escaping (of '%', etc) compared to browsers2005-07-21karl-iagojomo
1248942 Fixcustom robots.txt NPE in 1.4.02005-07-31gojomoefc
1249828 Fix-5000 out-of-scope preconditions; -50 failure2005-08-01karl-iagojomo
1250437 Fixhost with trailing '.' in seeds ruins implied SURT2005-08-02gojomogojomo
1255137 FixDeferred URI should be 'free' (no cost against queue budget)2005-08-09karl-iagojomo
1257157 Fixcreate job: empty seeds box, broken settings tab2005-08-11gojomogojomo
1266713 Fixalready-seen test not working (1.4.0); canonicalization2005-08-22nobodystack-sf
1276044 FixPrune classes deprecated in 1.4.02005-08-29nobodystack-sf
1276201 FixNotification AFTER reports have been created2005-08-29stack-sfstack-sf
1283492 Fixmimetype-reports squashes together url-count and byte-count2005-09-06gojomostack-sf
1284353 FixHelp links to user and dev manuals 4042005-09-07stack-sfstack-sf
1291274 FixStatisticsTracker writes 'not set'. Crawl.log says 'no-type'2005-09-14karl-iastack-sf
1291305 Fixmime type report not consistent with crawl.log2005-09-14ia_igoria_igor
1306421 FixExtractorUniversal skips URI with '_'2005-09-27gojomogojomo
1323281 FixStringIndexOutOfBounds in SurtAuthorityQueueAssignmentPolicy2005-10-10gojomogojomo
1323287 FixBDB ArrayIndexOutOfBoundsException; corrupt queue2005-10-10karl-iagojomo
1323323 FixIDNAException: String too long / fixupDomainLabel in stdout2005-10-10karl-iagojomo
1323373 FixNPE in CrawlJobHandler.startNextJobInternal2005-10-10karl-iagojomo
1324245 Fiximplied-HTTP seeds with port numbers get -5000 errs2005-10-11karl-iagojomo
1325304 Fix'expert' view toggle does inadvertent submit2005-10-12karl-iagojomo
1332722 Fixtime-based averages wrong after checkpoint2005-10-19karl-iagojomo
1333669 Fixdelete function in "View or Edit Frontier URIs" Unsupported2005-10-20karl-iastack-sf
1344265 FixLaxuri code encodes tildes.2005-10-31karl-iastack-sf
1351818 FixInvestigate spate of recent OOMEs2005-11-08gojomostack-sf
1357528 FixDouble JMX registration problem on remote creation2005-11-15stack-sfstack-sf
1358542 FixDevUtils.extraInfo shows 2 legend lines, no content2005-11-16gojomogojomo
1358567 Fixtransient JSP NPE after starting job2005-11-16stack-sfgojomo
1369177 FixNPE in quotaEnforcer, if hostname for URI can't be resolved2005-11-29nobodysvc
1369619 FixNPE in QuotaEnforcer2005-11-29gojomogojomo
1370743 Fixproperties-based supported/ignored scheme setting broken2005-12-01gojomogojomo
1370761 FixArithmeticException: / by 0 WorkQueueFrontier.averageDepth2005-12-01gojomogojomo
1325230 Fixjmx import of seeds not working2005-10-12 14:29karl-iastack-sf
1324989 FixQueue counts wrong after checkpointing2005-10-12 09:10karl-iastack-sf
1123230 FixOut Of Memory after creating mutliple jobs2005-02-15 07:51karl-ianobody
1322280 Fix"Failed getPath" alerts (from RobotsExclusionPolicy)2005-10-10 02:42karl-iagojomo
1322264 FixNumberFormatException in FetchHTTP.innerProcess2005-10-10 02:34karl-iagojomo