13. Release 1.4.0 - 2005-04-28


Much improved memory usage, new scoping/filter model, and a new revisiting frontier. Over 90 bugs fixed.

13.1. Known Limitations/Issues

13.1.1. Glibc 2.3.2 and NPTL

NPTL is the 'new' linux threading model. It replaces linuxthreads the 'old' model. You can tell you're running NPTL if your java process shows as one process only in the process listing. Wwith linuxthreads, all java threads show as distinct linux processes. Linux threading is integral to glibc.

On rare occasions we've seen the crawler hang without obvious explaination when running with NPTL threading on linux. Doing a thread dump on the hung crawler, one version of the hung crawler has threads waiting to obtain a lock that no one apparently holds. Our reading has these rare, crawl-killing, hangs as a problem in glbc2.3.2 when running with NPTL (NPTL 0.60) (We used to hang frequently but workarounds seem to have mitigated the frequency of lockup making it extremely rare). An upgrade to glibc2.3.3+ seems to do away with these hangs. Glibc2.3.3 has NPTL 0.61. Fedora3 has glibc2.3.4. If an upgrade is not possible -- for example, the new glibc is not currently available for debian -- you can disable NPTL and run with old threads by setting the environment variable LD_ASSUME_KERNEL=2.4.1 (You can set this environment variable on a per process basis).

NPTL is usually the default threading model on linux and is usually what you want -- threads are more lightweight and java throughput seems to be slightly higher with NPTL enabled. Various are the ways in which you can see which threading model you are using. Do an ldd on the java executable to see what shared libraries its using. Note the location of the glibc shared library. Executing PATH_TO_GLIBC/lib.so.6, usually /lib/lib.so.6, will list details on glibc. Look in the listing for either 'nptl' or 'linuxthreads'. On debian systems, lib.so.6 is not executable but you can make it so. You can also do the following to determine library versions and which threading you are using: % getconf GNU_LIBC_VERSION and % getconf GNU_LIBPTHREAD_VERSION.

See [ 1086554 ] glibc 2.3.2 NPTL hang (Was bdbfrontier stall in...) for more on the issue.

When connecting to a secure server, if the server wants to switch from SSL V2 to SSL V3 when client is using a SUN JVM, the connection fails. See issue 1093962for more.

13.1.3. Using old jobs or profiles with 1.4

You'll need to make one change to make your old order.xml files and profiles to run with Heritrix 1.4.x. Below is a diff that shows the change that needs to be made (The type of the path changed from string to stringList):

+++ order.xml   2005-02-01 13:12:34.000000000 -0800
@@ -162,7 +162,9 @@
         <string name="prefix">BT</string>
         <string name="suffix"></string>
         <integer name="max-size-bytes">100000000</integer>
-        <string name="path">arcs</string>
+        <stringList name="path">
+          <string>arcs</string>
+        </stringList>
         <integer name="pool-max-active">5</integer>
         <integer name="pool-max-wait">300000</integer>

Sometimes you'll get a ConcurrentModificationException exception when you go to view or refresh the Frontier's report page. Workaround is to retry. The page should eventually come up.

13.1.5. New ARC file suffix

Pre-release 1.2.0, currently open ARC files that are being written to by the crawler were differentiated by an '.open' suffix. When the crawler finished writing, the suffix was removed. A new suffix has been introduced -- '.invalid' -- which the crawler will use to mark ARC files it thinks suspect -- usually because there was an IOException thrown during the writing of an ARC Record. Such ARCs need to be checked for validity. Run % gzip -t and % ARCReader --strict against all files with an '.invalid' suffix -- and any unclosed '.open' files present after a crawl has ended -- to check for corruption.

13.1.6. DNS lookups fail (-6 in crawl.log)

[1149470] all DNS attempts fail -6 discusses badly-formatted DNS records returned on windows platform that Heritrix fails to parse and it includes a pointer to a mailing list discussion of failed lookups on non-english windows. The issue includes description of a workaround.

13.1.7. FatalConfigurationException creating new job based on old

Older SUN JVMS -- pre-beta3 versions of the SUN JVM 1.5.0 for instance -- had an issue using nio copying files. Try upgrading your JVM. See [1178102] FCE on creation of new job based on job w/ overrides for more on this.

13.1.8. OutOfMemoryErrors (OOMEs)

Unusual pages -- pages of unorthodox structure, pages that contain thousands upon thousands of links -- will on occasion produce OOMEs.

There have been improvements regards memory usage running multiple jobs in series, Section 14.1.3, “Running more than one job in series throws OOME”, but starting up a new job after a long-running job can prompt OOMEs. Workaround for now is to restart Heritrix between the running of big jobs.

13.2. Changes

13.2.1. Berkeley DB Based Frontier

The BdbFrontier -- a frontier that keeps its queues of URIs in Berkeley DB Java Edition databases -- has been made the default Frontier. Other core datastructures such as the queue of 'alreadyseen' URIs have also been moved into bdbje databases.

13.2.2. The IP in dns ARC Records

Dns entries in ARCs look like this:

dns:www.archive.org 20050310233154 text/dns 58 20050310233154
www.archive.org.        1600    IN      A
The above record is for the lookup of www.archive.org.

Previous to 1.4.0, the IP used on the ARC Record metaline -- the first line of an ARC Record entry ( in the above example) -- was the IP of the host looked up. As of 1.4.0, we write the IP of the dns server that returned us the address looked up. Previous to this there was no recording of the dnsserver IP.

13.2.3. AdaptiveRevisitFrontier

A new, experimental Frontier with configurable revisiting policy and tools for noticing page change, etc.

13.2.4. DecidingScope and DecidingFilter

A.K.A New Scoping Model

A new, experimental scope and filter that allow the user to pick and choose from an assortment of ready-made decision rules and have each rule applied in an orderable sequence. The last non-PASS decision stands as the aggregate decision for the decide rule sequence.

13.2.5. Crawl Size Upper Bounds Update

Memory usage has been improved in this release. Previously RAM-based datastructures that grew without bound now are disk-backed kept in berkeley db databases. Where previous, see Section 17.1.1, “Crawl Size Upper Bounds”, Heritrix was unsuited for broad crawling, while still experimental, using default memory settings -- a heap of 256m -- broad-crawls of 5 to 6 days before encountering OutOfMemoryErrors (OOMEs) are now possible; longer if more heap is assigned. Where 10k hosts was an upper bound on narrow domain- or host-scoped crawls, now, using the default heap size, it should now be possible to do 500k+ hosts.

Long-running crawls that encounter hundreds-of-thousands of hosts over the life of a crawl, or crawls started with hundreds-of-thousands of seeds, continue to throw OutOfMemoryErrors because there are still a few RAM-based datastructures that grow without bound left in Heritrix; the lists of queue names and internal structures inside 3rd party libraries used by Heritrix. These last few items we intend to address in a later release.

13.2.6. IBM JVM Redux

Testing with IBM JVM 1.4.2 (Classic VM (build 1.4.2, J2RE 1.4.2 IBM build cxia32142sr1a-20050209 (JIT enabled: jitc))) using Heritrix 1.4.0, the SSL problem described in Section 14.1.1, “IBM JVM” is no longer present (All of our crawling of the last couple of months has been done on the latest SUN 1.5.0 JVMs).

