14. Release 1.2.0 - 2004-11-16

Abstract

Added IP-based politeness, configurable URI-canonicalization, and mid-fetch abort. Lots of Bug fixes.

14.1. Known Limitations

14.1.1. IBM JVM

The IBM JVM generally is more performant than SUN JVMs. It also emits more detailed heap dumps. That said, new Heritrix 1.2.0 features may not work on the IBM JVM.

14.1.1.1. HTTPS

Heritrix 1.2.0 uses the new HttpClient 3.0x library which allows the setting of socket read timeouts. Connections to https sites fail if using the IBM JVM.

The IBM JVM 141 (cxia321411-20030930) NPEs setting the NoTcpDelay.

java.lang.NullPointerException
   at com.ibm.jsse.bf.setTcpNoDelay(Unknown Source)
   at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:683)
   at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)

Using the IBM JVM 142, its saying SSL connection not open when we go to use inputstreams:

java.net.SocketException: Socket is not connected
   at java.net.Socket.getInputStream(Socket.java:726)     at com.ibm.jsse.bs.getInputStream(Unknown Source)
   at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:715)
   at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)

Newer versions of the httpclient library may address this (Current version is alpha2).

14.1.2. Jobs don't show in UI when a bunch are run back-to-back

If more than one job waiting in the queue of pending jobs, then the second job often won't show in the UI; The UI says its running but its not possible to see a status bar on the running job. See [ 1024120 ] Lost crawl job after terminate running job with jobs pending. For now, the workaround is to study the running job by viewing the crawl job logs on disk (Oddly, the 3rd queued up job will start to show in the UI again).

14.1.3. Running more than one job in series throws OOME

OutOfMemoryExceptions are frequent when jobs are run in series. [ 1055592 ] terminated crawl still hogging memory, causing OOM. For now, restart Heritrix between the running of jobs.

14.2. Changes

Table 7. Changes

IDTypeSummaryOpen DateBy
1067095AddHang in http fetcher when mid-fetch aborts2004-11-15stack-sf
1066804AddAllow specification of heritrix_out.log filename2004-11-15stack-sf
903845AddIP-based politeness2004-10-28gojomo
1054849AddRecover from crawl initialized with a recovery log2004-10-26stack-sf
1054851AddImport gzipped or non-gzipped recovery log2004-10-26stack-sf
1050378AddAdd bdb alreadyseen option to hostsqueuesfrontier2004-10-19stack-sf
973881AddForce generation of report files2004-06-16stack-sf
1010883AddScripts to generate end-of-job reports2004-08-17danavery
988277Add[Need feedback] "Done with ARC file" event2004-07-09stack-sf
1044977AddLogging of scope-rejected URIs2004-10-11stack-sf
902970AddHTTPClient should use supplied IP / avoid DNS lookup2004-02-23stack-sf
903093AddSetting of Integer.MAX_VALUE is ugly2004-02-23stack-sf
900004Addcanonicalization of URIs for alreadyIncluded testing2004-02-18stack-sf
941072AddAllow operator-configured mid-HTTP-fetch filters2004-04-23stack-sf
1037891AddCmdline defaults in properties file2004-09-30stack-sf
1037304AddUpgrade httpclient to 3.0.x2004-09-29stack-sf
994141AddUpdate build to use maven 1.02004-07-19stack-sf
1002336AddFigure what profiler to use2004-08-02stack-sf
1064887Fixhttp and https prerequisites contention2004-11-11stack-sf
1062604FixSeed to SURT coversion issuesI22004-11-08stack-sf
11061795FixServerCache HashMaps access thread-safetyI22004-11-06gojomo
1060589FixCan't open logs of old jobs post-restart in UII22004-11-04stack-sf
1058565FixNon-default 'logs' location doesn't show in web UI2004-11-010stack-sf
1058568FixIMG 'lowsrc' may not be extracted2004-11-010stack-sf
1055854Fixcompleted crawls show as 'aborted by user'2004-10-270gojomo
1059237FixMultiThreadedHttpConnectionManager https already connected2004-11-020stack-sf
1052578Fixrecovery log of recovered crawl insufficient to recover2004-10-220stack-sf
908690FixSome dates are GMT, others are not *2004-03-020gojomo
958096FixFlushing CrawlServers problematic *2004-05-210gojomo
1052570FixThreads contend for scratch files (after kill/readFully/Gap)2004-10-220gojomo
1033701Fixincorrect number of total active threads *2004-09-230gojomo
1000840Fixdiskincludedfrontier performance is awful2004-07-30gojomo
1043251Fixbetter/longer dns retries on lookup failure2004-10-08gojomo
1051072FixExtractorHTML takes forever on worst-case HTML2004-10-20gojomo
1051916FixExtractorJS takes forever on worst-case JS2004-10-21gojomo
1050238Fixjdk required (doc implies jre)2004-10-19stack-sf
1038135Fixprerequisite hysteresis/robots ahead of dns2004-09-30gojomo
1015728FixCrawl upper time/size bounds ignored2004-08-24gojomo
1002356Fixtiming issue on crawl-start & run-time stat2004-08-02gojomo
1002332FixinactiveQueuesMemoryLoadTarget mechanism behaves poorly2004-08-02gojomo
1045016FixDNS URIs don't get override settings2004-10-11gojomo
998184FixGzipped recover log corrupt at end; last < 32K unrecoverable2004-07-26gojomo
998272FixNo crawl if host-queues-memory-capacity = 02004-07-26stack-sf
1002335Fixfrontier report unusable in big crawls; frontier info needed2004-08-02gojomo
984390FixBuild fails: "rws" mode and Mac OS X interact badly2004-07-02stack-sf
1000929Fixfatal runtimeexceptions in frontier give no info in web UI2004-07-30gojomo
964625Fixseed parser *too* lenient2004-06-01johnerik
980051FixAuth unsupported logged to console2004-06-25stack-sf
1002146Fixbad queue keys: shouldn't be URIs; should be handled better2004-08-02stack-sf
1046696FixUURIFactory.validateEscaping() -> IllegalArgumentException2004-10-13stack-sf
1045736FixARCReader crashes if zero-length gzip record2004-10-12stack-sf
1002144Fix[UURI] Catch bad-encoding earlier2004-08-02stack-sf
1036680FixPathDepthFilter innerAccepts SEVERE log: "Failed getPath..."2004-09-28stack-sf
1045847FixUnnecessary toString() in ExtractorHTML.processScriptCode()2004-10-12gojomo
1044527FixDomain names in 'overrides' are not in alphabetical order2004-10-11kristinn_sig
1012639FixIf CC timesout selftest, no build failed message2004-08-19stack-sf
1012642Fixselftest hanging because no crawl stop event2004-08-19stack-sf
931565FixCrawlStateUpdater - NullPointerException2004-04-08stack-sf
973294FixNoSuchElementException in URI queues halts crawling2004-06-15gojomo
1033657Fix[UURI] >2047 AFTER escaping (Stops crawl)2004-09-23stack-sf
1010966Fixcrawl.log has URIs with spaces in them2004-08-17stack-sf
963970Fixunfetchable URI schemes should never be queued2004-05-31gojomo
1031607FixKeyedQueue server<->key mismatch noted: pfbuser<->mprsrv.agr2004-09-20stack-sf
1031525FixNPE reading override2004-09-20stack-sf
1031168FixWrong handling of date in ARCRecordMetaData2004-09-20johnerik