Abstract
Added IP-based politeness, configurable URI-canonicalization, and mid-fetch abort. Lots of Bug fixes.
The IBM JVM generally is more performant than SUN JVMs. It also emits more detailed heap dumps. That said, new Heritrix 1.2.0 features may not work on the IBM JVM.
Heritrix 1.2.0 uses the new HttpClient 3.0x library which allows the setting of socket read timeouts. Connections to https sites fail if using the IBM JVM.
The IBM JVM 141 (cxia321411-20030930) NPEs setting the NoTcpDelay.
java.lang.NullPointerException at com.ibm.jsse.bf.setTcpNoDelay(Unknown Source) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:683) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)
Using the IBM JVM 142, its saying SSL connection not open when we go to use inputstreams:
java.net.SocketException: Socket is not connected at java.net.Socket.getInputStream(Socket.java:726) at com.ibm.jsse.bs.getInputStream(Unknown Source) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:715) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)
Newer versions of the httpclient library may address this (Current version is alpha2).
If more than one job waiting in the queue of pending jobs, then the second job often won't show in the UI; The UI says its running but its not possible to see a status bar on the running job. See [ 1024120 ] Lost crawl job after terminate running job with jobs pending. For now, the workaround is to study the running job by viewing the crawl job logs on disk (Oddly, the 3rd queued up job will start to show in the UI again).
OutOfMemoryExceptions are frequent when jobs are run in series. [ 1055592 ] terminated crawl still hogging memory, causing OOM. For now, restart Heritrix between the running of jobs.
Table 7. Changes
ID | Type | Summary | Open Date | By |
---|---|---|---|---|
1067095 | Add | Hang in http fetcher when mid-fetch aborts | 2004-11-15 | stack-sf |
1066804 | Add | Allow specification of heritrix_out.log filename | 2004-11-15 | stack-sf |
903845 | Add | IP-based politeness | 2004-10-28 | gojomo |
1054849 | Add | Recover from crawl initialized with a recovery log | 2004-10-26 | stack-sf |
1054851 | Add | Import gzipped or non-gzipped recovery log | 2004-10-26 | stack-sf |
1050378 | Add | Add bdb alreadyseen option to hostsqueuesfrontier | 2004-10-19 | stack-sf |
973881 | Add | Force generation of report files | 2004-06-16 | stack-sf |
1010883 | Add | Scripts to generate end-of-job reports | 2004-08-17 | danavery |
988277 | Add | [Need feedback] "Done with ARC file" event | 2004-07-09 | stack-sf |
1044977 | Add | Logging of scope-rejected URIs | 2004-10-11 | stack-sf |
902970 | Add | HTTPClient should use supplied IP / avoid DNS lookup | 2004-02-23 | stack-sf |
903093 | Add | Setting of Integer.MAX_VALUE is ugly | 2004-02-23 | stack-sf |
900004 | Add | canonicalization of URIs for alreadyIncluded testing | 2004-02-18 | stack-sf |
941072 | Add | Allow operator-configured mid-HTTP-fetch filters | 2004-04-23 | stack-sf |
1037891 | Add | Cmdline defaults in properties file | 2004-09-30 | stack-sf |
1037304 | Add | Upgrade httpclient to 3.0.x | 2004-09-29 | stack-sf |
994141 | Add | Update build to use maven 1.0 | 2004-07-19 | stack-sf |
1002336 | Add | Figure what profiler to use | 2004-08-02 | stack-sf |
1064887 | Fix | http and https prerequisites contention | 2004-11-11 | stack-sf |
1062604 | Fix | Seed to SURT coversion issuesI | 22004-11-08 | stack-sf |
11061795 | Fix | ServerCache HashMaps access thread-safetyI | 22004-11-06 | gojomo |
1060589 | Fix | Can't open logs of old jobs post-restart in UII | 22004-11-04 | stack-sf |
1058565 | Fix | Non-default 'logs' location doesn't show in web UI | 2004-11-010 | stack-sf |
1058568 | Fix | IMG 'lowsrc' may not be extracted | 2004-11-010 | stack-sf |
1055854 | Fix | completed crawls show as 'aborted by user' | 2004-10-270 | gojomo |
1059237 | Fix | MultiThreadedHttpConnectionManager https already connected | 2004-11-020 | stack-sf |
1052578 | Fix | recovery log of recovered crawl insufficient to recover | 2004-10-220 | stack-sf |
908690 | Fix | Some dates are GMT, others are not * | 2004-03-020 | gojomo |
958096 | Fix | Flushing CrawlServers problematic * | 2004-05-210 | gojomo |
1052570 | Fix | Threads contend for scratch files (after kill/readFully/Gap) | 2004-10-220 | gojomo |
1033701 | Fix | incorrect number of total active threads * | 2004-09-230 | gojomo |
1000840 | Fix | diskincludedfrontier performance is awful | 2004-07-30 | gojomo |
1043251 | Fix | better/longer dns retries on lookup failure | 2004-10-08 | gojomo |
1051072 | Fix | ExtractorHTML takes forever on worst-case HTML | 2004-10-20 | gojomo |
1051916 | Fix | ExtractorJS takes forever on worst-case JS | 2004-10-21 | gojomo |
1050238 | Fix | jdk required (doc implies jre) | 2004-10-19 | stack-sf |
1038135 | Fix | prerequisite hysteresis/robots ahead of dns | 2004-09-30 | gojomo |
1015728 | Fix | Crawl upper time/size bounds ignored | 2004-08-24 | gojomo |
1002356 | Fix | timing issue on crawl-start & run-time stat | 2004-08-02 | gojomo |
1002332 | Fix | inactiveQueuesMemoryLoadTarget mechanism behaves poorly | 2004-08-02 | gojomo |
1045016 | Fix | DNS URIs don't get override settings | 2004-10-11 | gojomo |
998184 | Fix | Gzipped recover log corrupt at end; last < 32K unrecoverable | 2004-07-26 | gojomo |
998272 | Fix | No crawl if host-queues-memory-capacity = 0 | 2004-07-26 | stack-sf |
1002335 | Fix | frontier report unusable in big crawls; frontier info needed | 2004-08-02 | gojomo |
984390 | Fix | Build fails: "rws" mode and Mac OS X interact badly | 2004-07-02 | stack-sf |
1000929 | Fix | fatal runtimeexceptions in frontier give no info in web UI | 2004-07-30 | gojomo |
964625 | Fix | seed parser *too* lenient | 2004-06-01 | johnerik |
980051 | Fix | Auth unsupported logged to console | 2004-06-25 | stack-sf |
1002146 | Fix | bad queue keys: shouldn't be URIs; should be handled better | 2004-08-02 | stack-sf |
1046696 | Fix | UURIFactory.validateEscaping() -> IllegalArgumentException | 2004-10-13 | stack-sf |
1045736 | Fix | ARCReader crashes if zero-length gzip record | 2004-10-12 | stack-sf |
1002144 | Fix | [UURI] Catch bad-encoding earlier | 2004-08-02 | stack-sf |
1036680 | Fix | PathDepthFilter innerAccepts SEVERE log: "Failed getPath..." | 2004-09-28 | stack-sf |
1045847 | Fix | Unnecessary toString() in ExtractorHTML.processScriptCode() | 2004-10-12 | gojomo |
1044527 | Fix | Domain names in 'overrides' are not in alphabetical order | 2004-10-11 | kristinn_sig |
1012639 | Fix | If CC timesout selftest, no build failed message | 2004-08-19 | stack-sf |
1012642 | Fix | selftest hanging because no crawl stop event | 2004-08-19 | stack-sf |
931565 | Fix | CrawlStateUpdater - NullPointerException | 2004-04-08 | stack-sf |
973294 | Fix | NoSuchElementException in URI queues halts crawling | 2004-06-15 | gojomo |
1033657 | Fix | [UURI] >2047 AFTER escaping (Stops crawl) | 2004-09-23 | stack-sf |
1010966 | Fix | crawl.log has URIs with spaces in them | 2004-08-17 | stack-sf |
963970 | Fix | unfetchable URI schemes should never be queued | 2004-05-31 | gojomo |
1031607 | Fix | KeyedQueue server<->key mismatch noted: pfbuser<->mprsrv.agr | 2004-09-20 | stack-sf |
1031525 | Fix | NPE reading override | 2004-09-20 | stack-sf |
1031168 | Fix | Wrong handling of date in ARCRecordMetaData | 2004-09-20 | johnerik |