Abstract
Release 1.8.0 adds a number of minor improvements and fixes. Most notably, checkpointing can now be achieved with a single command (with the requisite pause/resume done automatically), and all URIs fetched may be tagged with the original seed URI from which they were discovered. (This source URI information is both in the crawl.log and a new 'source-report.txt' report available among the disk file reports.)
We expect release 1.8.0 to be the last release officially supported on JDK 1.4.x ("Java 2") Java; future releases will require JDK 1.5.x ("Java 5") Java facilities.
BDB-JE will complain 'No locks available' when crawler is being built/run on an NFS mount. Workaround is to locate the 'state' directory on a non-NFS-mounted volume.
The format of progress statistics' state-change log messages have been modified. State-change messages now have a tail that adds some context explaining why we're pausing, etc. Note, we will be adding originator of the status-change event to the progress statistics log post 1.8.0 -- i.e. whether event came of JMX or via the UI -- so be prepared for more progress log changes.
Now when you ask to checkpoint a running crawl, it will manage for you the pause, checkpoint, and resume cycle (If paused when checkpoint is invoked, the crawler will be set back into a paused state upon checkpoint completion).
Checkpoints made with 1.6.0 software cannot be recovered with 1.8.0 software. Core classes such as CrawlController have changed so their serialized representation as part of a checkpoint has also changed (We have not done the work to deserialize earlier versions of core classes serialized as part of a checkpoint).
Table 4. All Tracked Changes
ID | Type | Summary | Open Date | By | Filer |
---|---|---|---|---|---|
1440656 | Fix | upping total budget doesn't update/unretire queues | 2006-02-28 | karl-ia | gojomo |
1482761 | Fix | BDB Adler32 gc-lock OOME risk | 2006-05-05 | stack-sf | gojomo |
1371195 | Fix | [jmx] Make downloaded data count have constant units | 2005-12-01 | stack-sf | stack-sf |
1371326 | Fix | refactor/compact QuotaEnforcer code | 2005-12-01 | stack-sf | gojomo |
1379208 | Fix | crawl report/hosts-report stats leave out robots.txt | 2005-12-12 | gojomo | gojomo |
1415940 | Fix | Failed deregistration of container with jndi | 2006-01-26 | stack-sf | stack-sf |
1415942 | Fix | When multiple instances, there's always a runt in the litter | 2006-01-26 | stack-sf | stack-sf |
1417062 | Fix | JMX get alert by index broken. | 2006-01-27 | stack-sf | stack-sf |
1419272 | Fix | Corrupt job.state files obstruct crawl resumption | 2006-01-30 | stack-sf | stack-sf |
1442207 | Fix | stop alerts 'line in seed file ignored' for mixed seed/surt | 2006-03-02 | gojomo | ia_igor |
1462407 | Fix | IllegalArgumentException adding to source host report | 2006-03-31 | stack-sf | stack-sf |
1465369 | Fix | make_reports.pl outdated, broken | 2006-04-05 | gojomo | gojomo |
1475730 | Fix | OnHostsDecideRule/OnDomainsDecideRule not adding seed SURTs | 2006-04-24 | gojomo | gojomo |
1475638 | Fix | Robots.txt ignored if 206/203 Status Code | 2006-04-24 | gojomo | stack-sf |
1395637 | Fix | crawl.log entires do not reflect 'no space left' error | 2006-01-02 | karl-ia | ia_igor |
1400646 | Fix | ExtractorHTML/ExtractorJS 'hang' on many-backslash input | 2006-01-09 | karl-ia | gojomo |
1404316 | Fix | ExtractorCSS does not resolve relative URIs against BASE | 2006-01-12 | karl-ia | ia_igor |
1392104 | Fix | ExtractorJS NPE doing speculative extraction | 2005-12-28 | karl-ia | stack-sf |
1387423 | Add | [arcreader] Fetch records and iterate remote ARCs | 2005-12-21 | stack-sf | stack-sf |
1371178 | Add | [jmx] Add name of heritrix 'host' as att | 2005-12-01 | stack-sf | stack-sf |
1233079 | Add | replace util.concurrent with java.util.concurrent | 2005-07-05 | gojomo | gojomo |
1371202 | Add | [jmx] Regularize crawl end state messages | 2005-12-01 | stack-sf | stack-sf |
1365804 | Add | JmxUtils.getOpenType() must handle Doubles | 2005-11-24 | stack-sf | nobody |
1374947 | Add | [jmx] progress statistics as notification | 2005-12-06 | stack-sf | stack-sf |
1388275 | Add | [contrib] Preselector ATTR_ALLOW_BY_REGEXP | 2005-12-22 | stack-sf | stack-sf |
1393254 | Add | 'total' bytes/fetches quota options in QuotaEnforcer | 2005-12-29 | gojomo | gojomo |
1119608 | Add | Carry forward (& log) 'originating URL/seed' for all URLs | 2005-02-09 | gojomo | gojomo |
1358617 | Add | Add destroy to JMX API | 2005-11-16 | karl-ia | stack-sf |
1445970 | Add | New "seed source report" of # of URLs per host per source | 2006-03-08 | karl-ia | stack-sf |
1436290 | Add | improve surt docs; esp 'surts-source-file' syntax | 2006-02-21 | nobody | gojomo |
1302207 | Add | unattended checkpointing | 2005-09-23 | karl-ia | gojomo |