11. Release 1.8.0 - 2006-05-05

Abstract

Release 1.8.0 adds a number of minor improvements and fixes. Most notably, checkpointing can now be achieved with a single command (with the requisite pause/resume done automatically), and all URIs fetched may be tagged with the original seed URI from which they were discovered. (This source URI information is both in the crawl.log and a new 'source-report.txt' report available among the disk file reports.)

We expect release 1.8.0 to be the last release officially supported on JDK 1.4.x ("Java 2") Java; future releases will require JDK 1.5.x ("Java 5") Java facilities.

11.1. Known Limitations/Issues

11.1.1. java.io.IOException: No locks available

BDB-JE will complain 'No locks available' when crawler is being built/run on an NFS mount. Workaround is to locate the 'state' directory on a non-NFS-mounted volume.

11.1.2. "Channel closed, may be due to thread interrupt"

An error with this message has been observed intermittently when running on the Sun Java 6 ("mustang") beta JVM ("-beta2-b81"). A forthcoming fix from Sleepycat for BDB-JE may be necessary to resolve this issue.

11.2. Changes

11.2.1. Progress Statistics Log

The format of progress statistics' state-change log messages have been modified. State-change messages now have a tail that adds some context explaining why we're pausing, etc. Note, we will be adding originator of the status-change event to the progress statistics log post 1.8.0 -- i.e. whether event came of JMX or via the UI -- so be prepared for more progress log changes.

11.2.2. Checkpoints

Now when you ask to checkpoint a running crawl, it will manage for you the pause, checkpoint, and resume cycle (If paused when checkpoint is invoked, the crawler will be set back into a paused state upon checkpoint completion).

Checkpoints made with 1.6.0 software cannot be recovered with 1.8.0 software. Core classes such as CrawlController have changed so their serialized representation as part of a checkpoint has also changed (We have not done the work to deserialize earlier versions of core classes serialized as part of a checkpoint).

Table 4. All Tracked Changes

IDTypeSummaryOpen DateByFiler
1440656 Fixupping total budget doesn't update/unretire queues2006-02-28karl-iagojomo
1482761 FixBDB Adler32 gc-lock OOME risk2006-05-05stack-sfgojomo
1371195 Fix[jmx] Make downloaded data count have constant units2005-12-01stack-sfstack-sf
1371326 Fixrefactor/compact QuotaEnforcer code2005-12-01stack-sfgojomo
1379208 Fixcrawl report/hosts-report stats leave out robots.txt2005-12-12gojomogojomo
1415940 FixFailed deregistration of container with jndi2006-01-26stack-sfstack-sf
1415942 FixWhen multiple instances, there's always a runt in the litter2006-01-26stack-sfstack-sf
1417062 FixJMX get alert by index broken.2006-01-27stack-sfstack-sf
1419272 FixCorrupt job.state files obstruct crawl resumption2006-01-30stack-sfstack-sf
1442207 Fixstop alerts 'line in seed file ignored' for mixed seed/surt2006-03-02gojomoia_igor
1462407 FixIllegalArgumentException adding to source host report2006-03-31stack-sfstack-sf
1465369 Fixmake_reports.pl outdated, broken2006-04-05gojomogojomo
1475730 FixOnHostsDecideRule/OnDomainsDecideRule not adding seed SURTs2006-04-24gojomogojomo
1475638 FixRobots.txt ignored if 206/203 Status Code2006-04-24gojomostack-sf
1395637 Fixcrawl.log entires do not reflect 'no space left' error2006-01-02karl-iaia_igor
1400646 FixExtractorHTML/ExtractorJS 'hang' on many-backslash input2006-01-09karl-iagojomo
1404316 FixExtractorCSS does not resolve relative URIs against BASE2006-01-12karl-iaia_igor
1392104 FixExtractorJS NPE doing speculative extraction2005-12-28karl-iastack-sf
1387423 Add[arcreader] Fetch records and iterate remote ARCs2005-12-21stack-sfstack-sf
1371178 Add[jmx] Add name of heritrix 'host' as att2005-12-01stack-sfstack-sf
1233079 Addreplace util.concurrent with java.util.concurrent2005-07-05gojomogojomo
1371202 Add[jmx] Regularize crawl end state messages2005-12-01stack-sfstack-sf
1365804 AddJmxUtils.getOpenType() must handle Doubles2005-11-24stack-sfnobody
1374947 Add[jmx] progress statistics as notification2005-12-06stack-sfstack-sf
1388275 Add[contrib] Preselector ATTR_ALLOW_BY_REGEXP2005-12-22stack-sfstack-sf
1393254 Add'total' bytes/fetches quota options in QuotaEnforcer2005-12-29gojomogojomo
1119608 AddCarry forward (& log) 'originating URL/seed' for all URLs2005-02-09gojomogojomo
1358617 AddAdd destroy to JMX API2005-11-16karl-iastack-sf
1445970 AddNew "seed source report" of # of URLs per host per source2006-03-08karl-iastack-sf
1436290 Addimprove surt docs; esp 'surts-source-file' syntax2006-02-21nobodygojomo
1302207 Addunattended checkpointing2005-09-23karl-iagojomo