Abstract
Release 1.6.0 offers improved remote control and monitoring via JMX, a crawl-checkpointing facility, and experimental support for bloom filter already-included testing, partitioning a crawl across multiple independent crawlers, and per-host/domain/queue-grouping collection quotas. Performance and stability in large crawls is also improved. Among tracked issues, it includes 39 requested enhancements and fixes 96 reported bugs.
BDB will complain 'No locks available' when crawler is being built/run on an NFS mount. Workaround is not run on an NFS-mounted volume.
BDB 2.0.90 can overgrow its intended cache size due to a misestimation of instance sizes under 64bit Java VMs, which may be a major contributor to early Heritrix OutOfMemoryError problems on 64bit systems. A workaround is to cut the assigned percentage by 1/3 to 1/2. For example, change the 'bdb-cache-percent' setting to '40' or '30' (instead of the default 60% when no value is set here).
The Postselector has been refactored out of existence. Its responsibilities have been parcelled out to two new Processors: LinksScoper and FrontierScheduler. LinksScoper is responsible for scope checking of extracted links. FrontierScheduler does the scheduling of URIs with the Frontier.
This change was done to allow introduction of processors between scope checking and Frontier scheduling steps.
Because of this change, order files from 1.4.0 Heritrix or before will need to be updated -- Postselector references replaced by LinkScoper and FrontierScheduler references -- before they can be used with Heritrix 1.6.0 (Referencing a non-existent Postselector in an order file usually shows as -50 fetch status in crawl.log).
The layout and terminology of the web Console and header have been changed, and new readouts added. Most notably, "Crawler Status" and "Job Status" information have been moved to separate boxes, with the controls for each at the top of their respective boxes, near the current status information. Also, the "Crawling"/"Stopped" distinction in the crawler -- whether available pending jobs would be started as possible -- has been renamed "Crawling Jobs"/"Holding Jobs" for clarity.
Table 5. All Tracked Changes
ID | Type | Summary | Open Date | By | Filer |
---|---|---|---|---|---|
806831 | Add | XMLExtractor (XML/RSS) | 2003-09-15 | gojomo | gojomo |
983051 | Add | annotate what robots.txt would have precluded | 2004-06-30 | karl-ia | gojomo |
1069331 | Add | hold paused crawl at 'end', allowing all in-progress ops | 2004-11-19 | karl-ia | gojomo |
1081774 | Add | need way to delete overrides | 2004-12-08 | karl-ia | gojomo |
1104696 | Add | Confusion: CrawlController and CrawlJob States | 2005-01-18 | nobody | stack-sf |
1108006 | Add | alerts should show current processor | 2005-01-23 | gojomo | gojomo |
1108520 | Add | SURT needs facelift | 2005-01-24 | gojomo | stack-sf |
1119616 | Add | Decompose Postselector to Scoping and Scheduling components | 2005-02-09 | stack-sf | gojomo |
1122692 | Add | [contribution] New fixed number of queues policy | 2005-02-14 | stack-sf | stack-sf |
1173597 | Add | jmx api additions | 2005-03-30 | stack-sf | stack-sf |
1176934 | Add | [contrib] Generalize/Refactor BDB Frontier | 2005-04-05 | stack-sf | ck-heritrix |
1180630 | Add | [contrib] UI stacktrace dump (Depends on JDK150) | 2005-04-11 | stack-sf | ck-heritrix |
1183376 | Add | Post 1.4 Deprecate filter scope and remove post 1.6. | 2005-04-14 | stack-sf | stack-sf |
1190974 | Add | Quick resume without real recovery / Checkpointing | 2005-04-27 | karl-ia | ck-heritrix |
1196602 | Add | [contrib] Show estimated remaining time | 2005-05-06 | stack-sf | ck-heritrix |
1200205 | Add | add 'exhausted' queue count to frontier report | 2005-05-11 | gojomo | gojomo |
1204644 | Add | add 'memory used' to progress-statistics.log | 2005-05-18 | kristinn_sig | gojomo |
1205583 | Add | add CandidateURI parameter to UriUniqFilter.forget() | 2005-05-20 | stack-sf | ck-heritrix |
1207866 | Add | [contrib] ThreadLocal-version of TextUtil.getMatcher | 2005-05-24 | gojomo | ck-heritrix |
1207898 | Add | [contrib] WorkQueueFrontier: Store allQueues in RAM if poss. | 2005-05-24 | stack-sf | ck-heritrix |
1208293 | Add | List based URIRegExprFilter | 2005-05-25 | kristinn_sig | kristinn_sig |
1208510 | Add | [rfe-contrib] Add Stacktrace dump to ToeThread.report() | 2005-05-25 | stack-sf | ck-heritrix |
1208747 | Add | CrawlURI serialization bloated; should be slimmed | 2005-05-25 | gojomo | gojomo |
1208757 | Add | Cookies are thread traffic jam and memory hog | 2005-05-25 | gojomo | stack-sf |
1208770 | Add | garbage hot spot: SerialBinding & FastOutputStream.bump() | 2005-05-25 | gojomo | gojomo |
1211217 | Add | [contrib] Add debugging aid for BDB RuntimeExceptionWrapper | 2005-05-30 | stack-sf | ck-heritrix |
1217854 | Add | seed report of redirect should show where to | 2005-06-09 | karl-ia | gojomo |
1222764 | Add | Rotation of crawl logs | 2005-06-17 | karl-ia | stack-sf |
1223840 | Add | BdbWorkQueue origins should be based on full classKey | 2005-06-19 | gojomo | gojomo |
1225597 | Add | Expose the bdb je 2.0 jmx interface | 2005-06-22 | karl-ia | stack-sf |
1225729 | Add | new alreadyIncluded option: Bloom filter based | 2005-06-22 | gojomo | gojomo |
1254560 | Add | Add to queue-assignment-policy without compile | 2005-08-08 | karl-ia | stack-sf |
1260360 | Add | More than one Heritrix instance in a JVM instance | 2005-08-15 | karl-ia | stack-sf |
1261506 | Add | Multimachine Crawl Splitter Processor | 2005-08-16 | gojomo | stack-sf |
1262665 | Add | add dummy items at heads of BDB queues for performance | 2005-08-17 | gojomo | gojomo |
1302182 | Add | IP geolocation based scoping | 2005-09-23 | karl-ia | gojomo |
1302208 | Add | per-crawler 'load' summary numbers | 2005-09-23 | karl-ia | gojomo |
1325123 | Add | Checkpointing fixes/improvements | 2005-10-12 | stack-sf | stack-sf |
1329725 | Add | [uuri] When 'generous' mode, don't encode curly-brackets | 2005-10-18 | stack-sf | stack-sf |
965622 | Fix | Serious error during crawling did not produce an alert | 2004-06-03 | gojomo | kristinn_sig |
1000338 | Fix | [UURI] escaped absolute path not valid | 2004-07-29 | nobody | stack-sf |
1002356 | Fix | timing issue on crawl-start & "run time" stat | 2004-08-02 | gojomo | gojomo |
1059126 | Fix | Other than default seeds path is not used | 2004-11-02 | nobody | stack-sf |
1060517 | Fix | hard-coding of job dir in state.job makes moves awkward | 2004-11-04 | nobody | stack-sf |
1062727 | Fix | angle-brackets in URIs thwart frontier report | 2004-11-08 | kristinn_sig | gojomo |
1065413 | Fix | [uuri] '$' in path gets scheduled, spawns queueing error | 2004-11-12 | gojomo | stack-sf |
1080926 | Fix | reducing max-toe-threads has no effect | 2004-12-07 | gojomo | gojomo |
1083427 | Fix | Text incorrect in WUI when create new profile | 2004-12-11 | nobody | zhousp |
1090564 | Fix | max-trans-hops=0 generates -63 in crawl.log | 2004-12-23 | gojomo | stack-sf |
1090916 | Fix | RIS still open; ThreadLocalConnectionManager async close | 2004-12-24 | karl-ia | stack-sf |
1113410 | Fix | NPE readalert_jsp._jspService | 2005-01-31 | gojomo | stack-sf |
1116456 | Fix | ARCWriter length is wrong (But coherent gzip record) | 2005-02-04 | karl-ia | stack-sf |
1119644 | Fix | frontier report ConcurrentModificationException | 2005-02-09 | gojomo | frodobay |
1122836 | Fix | Localize StackOverflowError in Extractors | 2005-02-14 | nobody | gojomo |
1181892 | Fix | Aggressive extraction of `for' attributes | 2005-04-12 | karl-ia | ia_igor |
1187973 | Fix | NPE in FetchDNS, caused by UURI | 2005-04-22 | nobody | ck-heritrix |
1192029 | Fix | OOME guard against pages of thousands of links | 2005-04-28 | gojomo | stack-sf |
1195312 | Fix | WaitEvaluators accidentally removed from Processors.options | 2005-05-04 | nobody | stack-sf |
1196594 | Fix | minor CSS typo, wrong text font. | 2005-05-06 | nobody | ck-heritrix |
1196630 | Fix | Build w/ 150 jdk won't run under 14x. | 2005-05-06 | stack-sf | stack-sf |
1200957 | Fix | Web UI recovery mangles paths | 2005-05-12 | karl-ia | karl-ia |
1203235 | Fix | can't change cost policy mid-crawl | 2005-05-16 | gojomo | gojomo |
1203588 | Fix | BdbFrontier, serious exception - LatchNotHeldException | 2005-05-17 | karl-ia | kristinn_sig |
1203958 | Fix | JMX remote command 'stop' leaves zombie crawler | 2005-05-17 | stack-sf | karl-ia |
1204643 | Fix | uri-errors.log: old timestamps, too many errors | 2005-05-18 | gojomo | gojomo |
1204667 | Fix | 'custom' robots policy doesn't work | 2005-05-18 | gojomo | gojomo |
1204931 | Fix | NPE when viewing crawl report | 2005-05-19 | karl-ia | kristinn_sig |
1207320 | Fix | [arcwriter] Record w/ empty body on OOME | 2005-05-23 | stack-sf | stack-sf |
1207378 | Fix | seeds listed without scheme, but with path, being ignored | 2005-05-23 | karl-ia | ia_igor |
1208804 | Fix | CachedBdbMap NPE killing off threads. | 2005-05-25 | karl-ia | stack-sf |
1209046 | Fix | Failed URIs should be 'free' (no cost against queue budget) | 2005-05-26 | gojomo | kristinn_sig |
1209665 | Fix | ReplayCharSequenceFactory: Unexpected response body offset | 2005-05-27 | karl-ia | ck-heritrix |
1212377 | Fix | URIException in deserialization, post CrawlURI slimming | 2005-05-31 | gojomo | gojomo |
1213095 | Fix | UURI handling of inconsistent escaping makes broken instance | 2005-06-01 | karl-ia | gojomo |
1214478 | Fix | ThreadLocalHttpConnectionManager starts a non-daemon Thread | 2005-06-03 | nobody | ck-heritrix |
1216633 | Fix | recovery fills heritrix_out with "Relative URI but no base.. | 2005-06-07 | gojomo | gojomo |
1217290 | Fix | Non-canonical seed URLs need better reporting | 2005-06-08 | karl-ia | karl-ia |
1218019 | Fix | GzippedInputStream class is not thread-safe | 2005-06-10 | nobody | ck-heritrix |
1218037 | Fix | CookieSpec interface modification breaks IgnoreCookiesSpec | 2005-06-10 | gojomo | ck-heritrix |
1218283 | Fix | ARCReader: Bad URI escaping for tab character | 2005-06-10 | nobody | ck-heritrix |
1218958 | Fix | odd preencoded URIs generate error on deserialization | 2005-06-11 | nobody | gojomo |
1219259 | Fix | broad crawls slow; most threads stuck retrying missing sites | 2005-06-12 | gojomo | gojomo |
1219262 | Fix | 'treat seed redirects as new seeds' not working | 2005-06-12 | karl-ia | gojomo |
1219486 | Fix | no rule for decidingscope to always crawl seeds | 2005-06-12 | gojomo | gojomo |
1219715 | Fix | [patch] Signature change broke BucketQueueAssignmentPolicy | 2005-06-13 | nobody | ck-heritrix |
1219854 | Fix | NPE je-2.0 entryToObject(SerialBinding.java:82) | 2005-06-13 | karl-ia | stack-sf |
1220714 | Fix | ExtractorHTML excessive temp strings / OOM | 2005-06-14 | karl-ia | gojomo |
1221570 | Fix | reports (web ui and to disk) don't scale | 2005-06-15 | gojomo | gojomo |
1222229 | Fix | unicode/idn domain names fail (seeds and more?)- punycode | 2005-06-16 | karl-ia | gojomo |
1222360 | Fix | strip-www canon. rule causes failed crawl of netarkivet.dk | 2005-06-16 | karl-ia | stack-sf |
1224531 | Fix | '@' in URI path confuses SURT (bad queues, scope probs, etc) | 2005-06-20 | karl-ia | gojomo |
1226365 | Fix | DomainSensitiveFrontier broken by uri-included-structure | 2005-06-23 | karl-ia | stack-sf |
1226387 | Fix | BdbUriUniqFilter URI fingerprint somewhat collision prone | 2005-06-23 | karl-ia | gojomo |
1226707 | Fix | CandidateURI serialization 'decodes' UURI | 2005-06-23 | karl-ia | stack-sf |
1230180 | Fix | nonstandard port URIs fail with '-50': Surt queue policy bug | 2005-06-30 | karl-ia | gojomo |
1230188 | Fix | DNS prereq problems (-50 fails/repeats): calculateInsertKey | 2005-06-30 | karl-ia | gojomo |
1231123 | Fix | IdentityCachingMapTest FAILED: cache not cleared appropriate | 2005-07-01 | gojomo | stack-sf |
1232402 | Fix | IdentityCachingMapTest fails on fedora | 2005-07-04 | nobody | stack-sf |
1232974 | Fix | WorkQueueFrontier.kickUpdate ClassCastException / unretiring | 2005-07-05 | karl-ia | gojomo |
1236094 | Fix | Possible deadlock situation with ARFrontier | 2005-07-11 | kristinn_sig | kristinn_sig |
1236334 | Fix | Cannot set cachePercentage in bdbje JMX bean | 2005-07-11 | stack-sf | stack-sf |
1236635 | Fix | Link ClassCastException | 2005-07-12 | kristinn_sig | kristinn_sig |
1239155 | Fix | [arcreader] Fails on records that only have headers | 2005-07-15 | karl-ia | stack-sf |
1241851 | Fix | Connection reset error with WebSTAR/3.0.2 web serve | 2005-07-20 | karl-ia | ia_igor |
1242747 | Fix | over-escaping (of '%', etc) compared to browsers | 2005-07-21 | karl-ia | gojomo |
1248942 | Fix | custom robots.txt NPE in 1.4.0 | 2005-07-31 | gojomo | efc |
1249828 | Fix | -5000 out-of-scope preconditions; -50 failure | 2005-08-01 | karl-ia | gojomo |
1250437 | Fix | host with trailing '.' in seeds ruins implied SURT | 2005-08-02 | gojomo | gojomo |
1255137 | Fix | Deferred URI should be 'free' (no cost against queue budget) | 2005-08-09 | karl-ia | gojomo |
1257157 | Fix | create job: empty seeds box, broken settings tab | 2005-08-11 | gojomo | gojomo |
1266713 | Fix | already-seen test not working (1.4.0); canonicalization | 2005-08-22 | nobody | stack-sf |
1276044 | Fix | Prune classes deprecated in 1.4.0 | 2005-08-29 | nobody | stack-sf |
1276201 | Fix | Notification AFTER reports have been created | 2005-08-29 | stack-sf | stack-sf |
1283492 | Fix | mimetype-reports squashes together url-count and byte-count | 2005-09-06 | gojomo | stack-sf |
1284353 | Fix | Help links to user and dev manuals 404 | 2005-09-07 | stack-sf | stack-sf |
1291274 | Fix | StatisticsTracker writes 'not set'. Crawl.log says 'no-type' | 2005-09-14 | karl-ia | stack-sf |
1291305 | Fix | mime type report not consistent with crawl.log | 2005-09-14 | ia_igor | ia_igor |
1306421 | Fix | ExtractorUniversal skips URI with '_' | 2005-09-27 | gojomo | gojomo |
1323281 | Fix | StringIndexOutOfBounds in SurtAuthorityQueueAssignmentPolicy | 2005-10-10 | gojomo | gojomo |
1323287 | Fix | BDB ArrayIndexOutOfBoundsException; corrupt queue | 2005-10-10 | karl-ia | gojomo |
1323323 | Fix | IDNAException: String too long / fixupDomainLabel in stdout | 2005-10-10 | karl-ia | gojomo |
1323373 | Fix | NPE in CrawlJobHandler.startNextJobInternal | 2005-10-10 | karl-ia | gojomo |
1324245 | Fix | implied-HTTP seeds with port numbers get -5000 errs | 2005-10-11 | karl-ia | gojomo |
1325304 | Fix | 'expert' view toggle does inadvertent submit | 2005-10-12 | karl-ia | gojomo |
1332722 | Fix | time-based averages wrong after checkpoint | 2005-10-19 | karl-ia | gojomo |
1333669 | Fix | delete function in "View or Edit Frontier URIs" Unsupported | 2005-10-20 | karl-ia | stack-sf |
1344265 | Fix | Laxuri code encodes tildes. | 2005-10-31 | karl-ia | stack-sf |
1351818 | Fix | Investigate spate of recent OOMEs | 2005-11-08 | gojomo | stack-sf |
1357528 | Fix | Double JMX registration problem on remote creation | 2005-11-15 | stack-sf | stack-sf |
1358542 | Fix | DevUtils.extraInfo shows 2 legend lines, no content | 2005-11-16 | gojomo | gojomo |
1358567 | Fix | transient JSP NPE after starting job | 2005-11-16 | stack-sf | gojomo |
1369177 | Fix | NPE in quotaEnforcer, if hostname for URI can't be resolved | 2005-11-29 | nobody | svc |
1369619 | Fix | NPE in QuotaEnforcer | 2005-11-29 | gojomo | gojomo |
1370743 | Fix | properties-based supported/ignored scheme setting broken | 2005-12-01 | gojomo | gojomo |
1370761 | Fix | ArithmeticException: / by 0 WorkQueueFrontier.averageDepth | 2005-12-01 | gojomo | gojomo |
1325230 | Fix | jmx import of seeds not working | 2005-10-12 14:29 | karl-ia | stack-sf |
1324989 | Fix | Queue counts wrong after checkpointing | 2005-10-12 09:10 | karl-ia | stack-sf |
1123230 | Fix | Out Of Memory after creating mutliple jobs | 2005-02-15 07:51 | karl-ia | nobody |
1322280 | Fix | "Failed getPath" alerts (from RobotsExclusionPolicy) | 2005-10-10 02:42 | karl-ia | gojomo |
1322264 | Fix | NumberFormatException in FetchHTTP.innerProcess | 2005-10-10 02:34 | karl-ia | gojomo |