13. Release 1.4.0 - 2005-04-28

Abstract

Much improved memory usage, new scoping/filter model, and a new revisiting frontier. Over 90 bugs fixed.

13.1. Known Limitations/Issues

13.1.1. Glibc 2.3.2 and NPTL

NPTL is the 'new' linux threading model. It replaces linuxthreads the 'old' model. You can tell you're running NPTL if your java process shows as one process only in the process listing. Wwith linuxthreads, all java threads show as distinct linux processes. Linux threading is integral to glibc.

On rare occasions we've seen the crawler hang without obvious explaination when running with NPTL threading on linux. Doing a thread dump on the hung crawler, one version of the hung crawler has threads waiting to obtain a lock that no one apparently holds. Our reading has these rare, crawl-killing, hangs as a problem in glbc2.3.2 when running with NPTL (NPTL 0.60) (We used to hang frequently but workarounds seem to have mitigated the frequency of lockup making it extremely rare). An upgrade to glibc2.3.3+ seems to do away with these hangs. Glibc2.3.3 has NPTL 0.61. Fedora3 has glibc2.3.4. If an upgrade is not possible -- for example, the new glibc is not currently available for debian -- you can disable NPTL and run with old threads by setting the environment variable LD_ASSUME_KERNEL=2.4.1 (You can set this environment variable on a per process basis).

NPTL is usually the default threading model on linux and is usually what you want -- threads are more lightweight and java throughput seems to be slightly higher with NPTL enabled. Various are the ways in which you can see which threading model you are using. Do an ldd on the java executable to see what shared libraries its using. Note the location of the glibc shared library. Executing PATH_TO_GLIBC/lib.so.6, usually /lib/lib.so.6, will list details on glibc. Look in the listing for either 'nptl' or 'linuxthreads'. On debian systems, lib.so.6 is not executable but you can make it so. You can also do the following to determine library versions and which threading you are using: % getconf GNU_LIBC_VERSION and % getconf GNU_LIBPTHREAD_VERSION.

See [ 1086554 ] glibc 2.3.2 NPTL hang (Was bdbfrontier stall in...) for more on the issue.

When connecting to a secure server, if the server wants to switch from SSL V2 to SSL V3 when client is using a SUN JVM, the connection fails. See issue 1093962for more.

13.1.3. Using old jobs or profiles with 1.4

You'll need to make one change to make your old order.xml files and profiles to run with Heritrix 1.4.x. Below is a diff that shows the change that needs to be made (The type of the path changed from string to stringList):

+++ order.xml   2005-02-01 13:12:34.000000000 -0800
@@ -162,7 +162,9 @@
         <string name="prefix">BT</string>
         <string name="suffix"></string>
         <integer name="max-size-bytes">100000000</integer>
-        <string name="path">arcs</string>
+        <stringList name="path">
+          <string>arcs</string>
+        </stringList>
         <integer name="pool-max-active">5</integer>
         <integer name="pool-max-wait">300000</integer>
       </newObject>

Sometimes you'll get a ConcurrentModificationException exception when you go to view or refresh the Frontier's report page. Workaround is to retry. The page should eventually come up.

13.1.5. New ARC file suffix

Pre-release 1.2.0, currently open ARC files that are being written to by the crawler were differentiated by an '.open' suffix. When the crawler finished writing, the suffix was removed. A new suffix has been introduced -- '.invalid' -- which the crawler will use to mark ARC files it thinks suspect -- usually because there was an IOException thrown during the writing of an ARC Record. Such ARCs need to be checked for validity. Run % gzip -t and % ARCReader --strict against all files with an '.invalid' suffix -- and any unclosed '.open' files present after a crawl has ended -- to check for corruption.

13.1.6. DNS lookups fail (-6 in crawl.log)

[1149470] all DNS attempts fail -6 discusses badly-formatted DNS records returned on windows platform that Heritrix fails to parse and it includes a pointer to a mailing list discussion of failed lookups on non-english windows. The issue includes description of a workaround.

13.1.7. FatalConfigurationException creating new job based on old

Older SUN JVMS -- pre-beta3 versions of the SUN JVM 1.5.0 for instance -- had an issue using nio copying files. Try upgrading your JVM. See [1178102] FCE on creation of new job based on job w/ overrides for more on this.

13.1.8. OutOfMemoryErrors (OOMEs)

Unusual pages -- pages of unorthodox structure, pages that contain thousands upon thousands of links -- will on occasion produce OOMEs.

There have been improvements regards memory usage running multiple jobs in series, Section 14.1.3, “Running more than one job in series throws OOME”, but starting up a new job after a long-running job can prompt OOMEs. Workaround for now is to restart Heritrix between the running of big jobs.

13.2. Changes

13.2.1. Berkeley DB Based Frontier

The BdbFrontier -- a frontier that keeps its queues of URIs in Berkeley DB Java Edition databases -- has been made the default Frontier. Other core datastructures such as the queue of 'alreadyseen' URIs have also been moved into bdbje databases.

13.2.2. The IP in dns ARC Records

Dns entries in ARCs look like this:

dns:www.archive.org 207.241.238.254 20050310233154 text/dns 58 20050310233154
www.archive.org.        1600    IN      A       207.241.224.241
The above record is for the lookup of www.archive.org.

Previous to 1.4.0, the IP used on the ARC Record metaline -- the first line of an ARC Record entry (207.241.238.254 in the above example) -- was the IP of the host looked up. As of 1.4.0, we write the IP of the dns server that returned us the address looked up. Previous to this there was no recording of the dnsserver IP.

13.2.3. AdaptiveRevisitFrontier

A new, experimental Frontier with configurable revisiting policy and tools for noticing page change, etc.

13.2.4. DecidingScope and DecidingFilter

A.K.A New Scoping Model

A new, experimental scope and filter that allow the user to pick and choose from an assortment of ready-made decision rules and have each rule applied in an orderable sequence. The last non-PASS decision stands as the aggregate decision for the decide rule sequence.

13.2.5. Crawl Size Upper Bounds Update

Memory usage has been improved in this release. Previously RAM-based datastructures that grew without bound now are disk-backed kept in berkeley db databases. Where previous, see Section 17.1.1, “Crawl Size Upper Bounds”, Heritrix was unsuited for broad crawling, while still experimental, using default memory settings -- a heap of 256m -- broad-crawls of 5 to 6 days before encountering OutOfMemoryErrors (OOMEs) are now possible; longer if more heap is assigned. Where 10k hosts was an upper bound on narrow domain- or host-scoped crawls, now, using the default heap size, it should now be possible to do 500k+ hosts.

Long-running crawls that encounter hundreds-of-thousands of hosts over the life of a crawl, or crawls started with hundreds-of-thousands of seeds, continue to throw OutOfMemoryErrors because there are still a few RAM-based datastructures that grow without bound left in Heritrix; the lists of queue names and internal structures inside 3rd party libraries used by Heritrix. These last few items we intend to address in a later release.

13.2.6. IBM JVM Redux

Testing with IBM JVM 1.4.2 (Classic VM (build 1.4.2, J2RE 1.4.2 IBM build cxia32142sr1a-20050209 (JIT enabled: jitc))) using Heritrix 1.4.0, the SSL problem described in Section 14.1.1, “IBM JVM” is no longer present (All of our crawling of the last couple of months has been done on the latest SUN 1.5.0 JVMs).

Table 6. Changes

IDTypeSummaryOpen DateByFiler
958061 Add[Post 1.0] New scoping model2004-05-21gojomogojomo
1165205 AddAdd links to issue tracking/RFE to Heritrix' webapp2005-03-17nobodyck-heritrix
1119580 AddIntegrate revisiting frontier2005-02-09kristinn_sigstack-sf
1093609 AddOne-click recover2004-12-30gojomogojomo
1078008 AddEnable crawl-end at target compressed-ARC-data size2004-12-02stack-sfgojomo
934577 AddNeed 'delete profile' option (like delete job)2004-04-13kristinn_siggojomo
1058302 AddA 'dat' maker; A script to dump links2004-11-01stack-sfstack-sf
1114133 AddAdd referer header2005-02-01stack-sfstack-sf
1143892 Add[contribution] SingleConnectionManager, range and close hdrs2005-02-18stack-sfstack-sf
1055766 AddDates in logs are unreadable.2004-10-27gojomostack-sf
1111656 AddExtractors should not extract if links already extracted2005-01-28stack-sfstack-sf
1047437 AddPause and alert on low-disk conditions2004-10-14gojomogojomo
1104916 AddAdd info to candidateURI before scheduling2005-01-18stack-sfstack-sf
953994 AddChange arc download dir mid-crawl2004-05-14stack-sfstack-sf
894467 AddStopping, pausing, checkpointing from command line/scripts2004-02-10stack-sfstack-sf
1096737 Add[jmx] client pword and always start jmx server2005-01-05nobodystack-sf
1090663 AddMove BDB to core of Heritrix2004-12-23stack-sfstack-sf
1092769 Add[ARCReader] If garbage on end of record, report and skip it2004-12-29stack-sfstack-sf
1078016 Add'Economic' frontier which defers low-value URIs2004-12-02gojomogojomo
1002704 AddEvaluate Berkeley DB Frontier2004-08-03gojomostack-sf
1083315 AddUpdate commons-pool, commons-collections, itext jars2004-12-10nobodystack-sf
988276 AddARC writer pool config. to write multiple disks2004-07-09stack-sfstack-sf
1078714 AddCommand-line insertion of URLs2004-12-03stack-sfstack-sf
1069105 AddMake auto seed add on redirect optional (if happens at all)2004-11-18gojomogojomo
1002707 AddFix heritrix shutdown (From Luca)2004-08-03nobodystack-sf
1065736 AddRecovery should optionally retain failures ('Ff')2004-11-13gojomogojomo
1057064 AddHTTPRecorder's default buffer sizes should be configurable2004-10-29gojomogojomo
1045817 AddUntangle heritrix from jetty2004-10-12stack-sfstack-sf
1036720 FixNPE in ArcWriterProcessor.writeDns()2004-09-28stack-sfgojomo
1178927 Fix'submodules' map-edits not working for overrides/refinements2005-04-07gojomogojomo
1179530 FixNPE in FastBufferedOutputStream.close2005-04-08nobodystack-sf
1184102 FixFrontier queues total still goes minus2005-04-15nobodystack-sf
1179527 FixARCWriter AsynchronousCloseException2005-04-08nobodystack-sf
1096855 FixCME adding filters while crawling2005-01-05nobodystack-sf
1080378 Fixjob config: settings 'remove'-component-then-submit lost job2004-12-06nobodygojomo
1176788 Fixhosts-report.txt is empty2005-04-04stack-sfdanavery
1172183 FixDelete URIs from frontier broken (CachedBdbBigMap.values()?)2005-03-28gojomogojomo
1178102 FixFCE on creation of new job based on job w/ overrides2005-04-06nobodystack-sf
1178103 Fixhung bdb (12115 redux)2005-04-06nobodystack-sf
1169459 FixCachedBdbBigMap double-close in finialize()2005-03-23gojomogojomo
1177462 FixRIS#readFullyOrUntil IOE/timeout2005-04-05nobodystack-sf
1149470 Fixall DNS attempts fail -62005-02-22nobodyjsleeman
1156363 FixFlash SWF Extractor Unexpected end of input2005-03-03nobodystack-sf
1170562 Fixnpe in extractorjs doing broad crawl w/ HEAD2005-03-25nobodystack-sf
1121567 FixHeritrix 1.3.0 crashes hard (JVM SIGSEV)2005-02-12nobodystack-sf
1103015 FixIf filter in main scope disabled heritrix aborts imme2005-01-15nobodyfrodobay
1054219 FixLinks not extracted from mislabelled (text/plain) MIME type2004-10-25gojomogojomo
1024120 FixLost crawl job after terminate running job with jobs pending2004-09-07stack-sfstack-sf
1078094 Fixwww-strip canonicalization unintended exclusion of redirect2004-12-02stack-sfgojomo
1157085 FixDNS records in ARCs should use DNS server IP2005-03-04stack-sfgojomo
1157385 FixCrawler not making progress -- thread deadlock2005-03-05stack-sfia_igor
1158270 FixisMultibyteEncoding: Uncaught UnsupportedOperationException2005-03-07stack-sfck-heritrix
1080925 FixMultiThreadedConnectionManager bottleneck2004-12-07stack-sfgojomo
1157372 Fixmissing space in progress-statistics.log2005-03-05stack-sfia_igor
1153927 Fixnpe in ExtractorHTML#innerProcess2005-02-28stack-sfstack-sf
1155641 Fix"Illegal response body offset" in ReplayCharSequenceFactory2005-03-02stack-sfgojomo
1154673 Fixensure IPs match from DNS, used in HTTP, logged in ARC2005-03-01stack-sfgojomo
1002138 Fixswf extractor flash lib prints glyphcount on stdout2004-08-02nobodystack-sf
1077924 Fixcrawl.log timestamps out-of-order2004-12-02gojomogojomo
1066573 Fixsometimes job based-on other job uses older job name2004-11-15gojomogojomo
1102755 Fixseeds text area truncates seeds; big seed lists break config2005-01-14gojomogojomo
1002164 FixOOM hit very early broad-crawling2004-08-02stack-sfgojomo
1006970 FixUI list-ordering inconsistent2004-08-10gojomogojomo
1092937 FixUI/Settings - Expert Toggle loses user data2004-12-29nobodynobody
1152358 FixOOM in postselector2005-02-26nobodyorion2598
1068403 FixARCWriter gzip deflate hang2004-11-17nobodystack-sf
1123906 FixARCWriter alerts if Content-Type is null2005-02-16nobodyck-heritrix
1124029 FixBad synchronization causes NPE in StatisticsTracker2005-02-16nobodyck-heritrix
1055789 FixARCWriter 'Gap' errors should be more prominent2004-10-27stack-sfgojomo
1123859 FixChange in ExtractorHTML triggers NullPointerExceptions2005-02-16nobodyck-heritrix
1093073 FixStackOverflowError shouldn't kill crawl2004-12-29gojomogojomo
1068370 Fix[Flash] OOMEs on a particular URL2004-11-17nobodystack-sf
1108153 Fixunwritable ARCs directory barely noticeable2005-01-23nobodygojomo
1023929 Fix"&amp" converted to "&" in preselector override regex2004-09-07gojomodanavery:
1083428 Fixremove profile function in WUI?2004-12-11nobodyzhousp
1068384 Fixdeleting all(?) from queue corrupts frontier, kills crawl2004-11-17gojomogojomo
1106469 FixExtractorCSS regexp taking 'forever' on small document2005-01-20gojomogojomo
1116204 FixFetchDNS doesn't work (bug in dnsjava)2005-02-04nobodynobody
1103838 FixRedirect problem (Stops crawling after 3)2005-01-17nobodynobody
1119686 Fixoversight in CrawlURI; missing check for null2005-02-09nobodyfrodobay
1060508 Fix[uuri] port StringIndexOutOfBoundsExceptionn2004-11-04stack-sfstack-sf
1101831 FixNPE in ROS#record2005-01-13nobodystack-sf
1114285 FixOld profile/jobs won't work with HEAD (1.4)2005-02-01nobodystack-sf
1062621 FixFirst arc record length is off by one2004-11-08stack-sfstack-sf
1117916 FixPDFParser URL extraction bug2005-02-07nobodybenlitchfield
1113977 FixUser Agent is tolowercased2005-02-01nobodynobody
1113470 FixException in Modules Tab2005-01-31nobodynobody
1109521 FixHung Thread in StatisticsTracker2005-01-25stack-sfia_igor
1107304 FixFailed create new job based on job with absolute settings2005-01-22nobodyfrodobay
1000865 FixLong random pauses where no progress is made2004-07-30nobodygojomo
1095952 FixInvalidJobFileException: Status .. 'RUNNING'2005-01-04nobodystack-sf
1095453 Fixheritrix wont start with fedora core 32005-01-03nobodynobody
1092135 Fixcrawl.log hashes wrong for captures > 64K2004-12-28gojomogojomo
1103133 Fixdeadlock in ip-politeness requeueing2005-01-15gojomogojomo
1102771 FixSURTs-from-seeds may lack trailing comma2005-01-14gojomogojomo
1101396 FixJS extr. does not parse spec. links starting w/ ./ or ../2005-01-12nobodyia_igor
1100658 Fixupdate to [ 1100467 ] maven 1.0.2 build problem2005-01-11stack-sfnobody
1101138 FixUpdate ant and httpclient jars2005-01-12stack-sfstack-sf
1098217 FixReplayCharSequence.toString() is broken2005-01-07nobodystack-sf
1093627 Fix[robots] robots.txt midfetch aborted gives open access2004-12-30nobodystack-sf
1093614 Fixmidfetch abort doesn't2004-12-30nobodystack-sf
1082358 Fix[uuri] String index out of range: 02004-12-09stack-sfstack-sf
1086554 Fixglibc 2.3.2 NPTL hang (Was bdbfrontier stall in...)2004-12-16stack-sfstack-sf
1072035 Fix[uuri] Underscore in host messes up port parsing2004-11-23stack-sfstack-sf
1043251 Fixbetter/longer dns retries on lookup failure2004-10-08gojomostack-sf
1090911 FixNPE in ServerCache2004-12-24stack-sfstack-sf
1080926 Fixreducing max-toe-threads has no effect2004-12-07gojomogojomo
1088788 FixNPE in TextUtils.freeMatcher()2004-12-20stack-sfgojomo
1082570 Fixheritrix.log ignored2004-12-09nobodystack-sf
1078503 FixEdit configuration in UI gives NPE2004-12-03nobodystack-sf
1055592 Fixterminated crawl still hogging memory, causing OOM2004-10-27nobodygojomo
1081770 Fixquick-override accepts domain w/spaces, lost checkboxes2004-12-08gojomogojomo
1080827 FixBrowser hangs when hundreds of seeds2004-12-07nobodystack-sf
1047396 FixOOM in BdbFrontier/nio.Bits -- with plenty of heap left2004-10-14nobodygojomo
1078581 FixDomainSensitiveFrontier never finishes2004-12-03nobodystack-sf
1076251 FixUpgrade bdbje 1.7.0 (WAS: Checkpointer thread ...)2004-11-30nobodystack-sf
1072192 Fixbdbfrontier No locks available2004-11-23nobodystack-sf
1031499 FixDeleted pending jobs show as pending in restart.2004-09-20stack-sfstack-sf