Abstract
Much improved memory usage, new scoping/filter model, and a new revisiting frontier. Over 90 bugs fixed.
NPTL is the 'new' linux threading model. It replaces linuxthreads the 'old' model. You can tell you're running NPTL if your java process shows as one process only in the process listing. Wwith linuxthreads, all java threads show as distinct linux processes. Linux threading is integral to glibc.
On rare occasions we've seen the crawler hang without obvious
explaination when running with NPTL threading on linux. Doing a thread
dump on the hung crawler, one version of the hung crawler has threads
waiting to obtain a lock that no one apparently holds. Our reading has
these rare, crawl-killing, hangs as a problem in glbc2.3.2 when
running with NPTL (NPTL 0.60) (We used to hang frequently but
workarounds seem to have mitigated the frequency of lockup making it
extremely rare). An upgrade to glibc2.3.3+ seems to do away with these
hangs. Glibc2.3.3 has NPTL 0.61. Fedora3 has glibc2.3.4. If an upgrade
is not possible -- for example, the new glibc is not currently
available for debian -- you can disable NPTL and run with old threads
by setting the environment variable
LD_ASSUME_KERNEL=2.4.1
(You can set this
environment variable on a per process basis).
NPTL is usually the default threading model on linux and is
usually what you want -- threads are more lightweight and java
throughput seems to be slightly higher with NPTL enabled. Various are
the ways in which you can see which threading model you are using. Do
an ldd on the java executable to see what shared libraries its using.
Note the location of the glibc shared library. Executing
PATH_TO_GLIBC/lib.so.6
, usually
/lib/lib.so.6
, will list details on glibc. Look in
the listing for either 'nptl' or 'linuxthreads'. On debian systems,
lib.so.6 is not executable but you can make it so. You can also do the
following to determine library versions and which threading you are
using: % getconf GNU_LIBC_VERSION
and %
getconf GNU_LIBPTHREAD_VERSION
.
See [ 1086554 ] glibc 2.3.2 NPTL hang (Was bdbfrontier stall in...) for more on the issue.
When connecting to a secure server, if the server wants to switch from SSL V2 to SSL V3 when client is using a SUN JVM, the connection fails. See issue 1093962for more.
You'll need to make one change to make your old order.xml files
and profiles to run with Heritrix 1.4.x. Below is a diff that shows
the change that needs to be made (The type of the
path
changed from string
to
stringList
):
+++ order.xml 2005-02-01 13:12:34.000000000 -0800 @@ -162,7 +162,9 @@ <string name="prefix">BT</string> <string name="suffix"></string> <integer name="max-size-bytes">100000000</integer> - <string name="path">arcs</string> + <stringList name="path"> + <string>arcs</string> + </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> </newObject>
Sometimes you'll get a ConcurrentModificationException exception when you go to view or refresh the Frontier's report page. Workaround is to retry. The page should eventually come up.
Pre-release 1.2.0, currently open ARC files that are being
written to by the crawler were differentiated by an '.open' suffix.
When the crawler finished writing, the suffix was removed. A new
suffix has been introduced -- '.invalid' -- which the crawler will use
to mark ARC files it thinks suspect -- usually because there was an
IOException thrown during the writing of an ARC Record. Such ARCs need
to be checked for validity. Run % gzip -t
and
% ARCReader --strict
against all files with an
'.invalid' suffix -- and any unclosed '.open' files present after a
crawl has ended -- to check for corruption.
[1149470] all DNS attempts fail -6 discusses badly-formatted DNS records returned on windows platform that Heritrix fails to parse and it includes a pointer to a mailing list discussion of failed lookups on non-english windows. The issue includes description of a workaround.
Older SUN JVMS -- pre-beta3 versions of the SUN JVM 1.5.0 for instance -- had an issue using nio copying files. Try upgrading your JVM. See [1178102] FCE on creation of new job based on job w/ overrides for more on this.
Unusual pages -- pages of unorthodox structure, pages that contain thousands upon thousands of links -- will on occasion produce OOMEs.
There have been improvements regards memory usage running multiple jobs in series, Section 14.1.3, “Running more than one job in series throws OOME”, but starting up a new job after a long-running job can prompt OOMEs. Workaround for now is to restart Heritrix between the running of big jobs.
The BdbFrontier -- a frontier that keeps its queues of URIs in Berkeley DB Java Edition databases -- has been made the default Frontier. Other core datastructures such as the queue of 'alreadyseen' URIs have also been moved into bdbje databases.
Dns entries in ARCs look like this:
dns:www.archive.org 207.241.238.254 20050310233154 text/dns 58 20050310233154 www.archive.org. 1600 IN A 207.241.224.241The above record is for the lookup of www.archive.org.
Previous to 1.4.0, the IP used on the ARC Record metaline -- the first line of an ARC Record entry (207.241.238.254 in the above example) -- was the IP of the host looked up. As of 1.4.0, we write the IP of the dns server that returned us the address looked up. Previous to this there was no recording of the dnsserver IP.
A new, experimental Frontier with configurable revisiting policy and tools for noticing page change, etc.
A new, experimental scope and filter that allow the user to pick and choose from an assortment of ready-made decision rules and have each rule applied in an orderable sequence. The last non-PASS decision stands as the aggregate decision for the decide rule sequence.
Memory usage has been improved in this release. Previously RAM-based datastructures that grew without bound now are disk-backed kept in berkeley db databases. Where previous, see Section 17.1.1, “Crawl Size Upper Bounds”, Heritrix was unsuited for broad crawling, while still experimental, using default memory settings -- a heap of 256m -- broad-crawls of 5 to 6 days before encountering OutOfMemoryErrors (OOMEs) are now possible; longer if more heap is assigned. Where 10k hosts was an upper bound on narrow domain- or host-scoped crawls, now, using the default heap size, it should now be possible to do 500k+ hosts.
Long-running crawls that encounter hundreds-of-thousands of hosts over the life of a crawl, or crawls started with hundreds-of-thousands of seeds, continue to throw OutOfMemoryErrors because there are still a few RAM-based datastructures that grow without bound left in Heritrix; the lists of queue names and internal structures inside 3rd party libraries used by Heritrix. These last few items we intend to address in a later release.
Testing with IBM JVM 1.4.2 (Classic VM (build 1.4.2,
J2RE 1.4.2 IBM build cxia32142sr1a-20050209 (JIT enabled:
jitc)))
using Heritrix 1.4.0, the SSL problem described in
Section 14.1.1, “IBM JVM” is no longer present (All of our crawling of
the last couple of months has been done on the latest SUN 1.5.0
JVMs).
Table 6. Changes
ID | Type | Summary | Open Date | By | Filer |
---|---|---|---|---|---|
958061 | Add | [Post 1.0] New scoping model | 2004-05-21 | gojomo | gojomo |
1165205 | Add | Add links to issue tracking/RFE to Heritrix' webapp | 2005-03-17 | nobody | ck-heritrix |
1119580 | Add | Integrate revisiting frontier | 2005-02-09 | kristinn_sig | stack-sf |
1093609 | Add | One-click recover | 2004-12-30 | gojomo | gojomo |
1078008 | Add | Enable crawl-end at target compressed-ARC-data size | 2004-12-02 | stack-sf | gojomo |
934577 | Add | Need 'delete profile' option (like delete job) | 2004-04-13 | kristinn_sig | gojomo |
1058302 | Add | A 'dat' maker; A script to dump links | 2004-11-01 | stack-sf | stack-sf |
1114133 | Add | Add referer header | 2005-02-01 | stack-sf | stack-sf |
1143892 | Add | [contribution] SingleConnectionManager, range and close hdrs | 2005-02-18 | stack-sf | stack-sf |
1055766 | Add | Dates in logs are unreadable. | 2004-10-27 | gojomo | stack-sf |
1111656 | Add | Extractors should not extract if links already extracted | 2005-01-28 | stack-sf | stack-sf |
1047437 | Add | Pause and alert on low-disk conditions | 2004-10-14 | gojomo | gojomo |
1104916 | Add | Add info to candidateURI before scheduling | 2005-01-18 | stack-sf | stack-sf |
953994 | Add | Change arc download dir mid-crawl | 2004-05-14 | stack-sf | stack-sf |
894467 | Add | Stopping, pausing, checkpointing from command line/scripts | 2004-02-10 | stack-sf | stack-sf |
1096737 | Add | [jmx] client pword and always start jmx server | 2005-01-05 | nobody | stack-sf |
1090663 | Add | Move BDB to core of Heritrix | 2004-12-23 | stack-sf | stack-sf |
1092769 | Add | [ARCReader] If garbage on end of record, report and skip it | 2004-12-29 | stack-sf | stack-sf |
1078016 | Add | 'Economic' frontier which defers low-value URIs | 2004-12-02 | gojomo | gojomo |
1002704 | Add | Evaluate Berkeley DB Frontier | 2004-08-03 | gojomo | stack-sf |
1083315 | Add | Update commons-pool, commons-collections, itext jars | 2004-12-10 | nobody | stack-sf |
988276 | Add | ARC writer pool config. to write multiple disks | 2004-07-09 | stack-sf | stack-sf |
1078714 | Add | Command-line insertion of URLs | 2004-12-03 | stack-sf | stack-sf |
1069105 | Add | Make auto seed add on redirect optional (if happens at all) | 2004-11-18 | gojomo | gojomo |
1002707 | Add | Fix heritrix shutdown (From Luca) | 2004-08-03 | nobody | stack-sf |
1065736 | Add | Recovery should optionally retain failures ('Ff') | 2004-11-13 | gojomo | gojomo |
1057064 | Add | HTTPRecorder's default buffer sizes should be configurable | 2004-10-29 | gojomo | gojomo |
1045817 | Add | Untangle heritrix from jetty | 2004-10-12 | stack-sf | stack-sf |
1036720 | Fix | NPE in ArcWriterProcessor.writeDns() | 2004-09-28 | stack-sf | gojomo |
1178927 | Fix | 'submodules' map-edits not working for overrides/refinements | 2005-04-07 | gojomo | gojomo |
1179530 | Fix | NPE in FastBufferedOutputStream.close | 2005-04-08 | nobody | stack-sf |
1184102 | Fix | Frontier queues total still goes minus | 2005-04-15 | nobody | stack-sf |
1179527 | Fix | ARCWriter AsynchronousCloseException | 2005-04-08 | nobody | stack-sf |
1096855 | Fix | CME adding filters while crawling | 2005-01-05 | nobody | stack-sf |
1080378 | Fix | job config: settings 'remove'-component-then-submit lost job | 2004-12-06 | nobody | gojomo |
1176788 | Fix | hosts-report.txt is empty | 2005-04-04 | stack-sf | danavery |
1172183 | Fix | Delete URIs from frontier broken (CachedBdbBigMap.values()?) | 2005-03-28 | gojomo | gojomo |
1178102 | Fix | FCE on creation of new job based on job w/ overrides | 2005-04-06 | nobody | stack-sf |
1178103 | Fix | hung bdb (12115 redux) | 2005-04-06 | nobody | stack-sf |
1169459 | Fix | CachedBdbBigMap double-close in finialize() | 2005-03-23 | gojomo | gojomo |
1177462 | Fix | RIS#readFullyOrUntil IOE/timeout | 2005-04-05 | nobody | stack-sf |
1149470 | Fix | all DNS attempts fail -6 | 2005-02-22 | nobody | jsleeman |
1156363 | Fix | Flash SWF Extractor Unexpected end of input | 2005-03-03 | nobody | stack-sf |
1170562 | Fix | npe in extractorjs doing broad crawl w/ HEAD | 2005-03-25 | nobody | stack-sf |
1121567 | Fix | Heritrix 1.3.0 crashes hard (JVM SIGSEV) | 2005-02-12 | nobody | stack-sf |
1103015 | Fix | If filter in main scope disabled heritrix aborts imme | 2005-01-15 | nobody | frodobay |
1054219 | Fix | Links not extracted from mislabelled (text/plain) MIME type | 2004-10-25 | gojomo | gojomo |
1024120 | Fix | Lost crawl job after terminate running job with jobs pending | 2004-09-07 | stack-sf | stack-sf |
1078094 | Fix | www-strip canonicalization unintended exclusion of redirect | 2004-12-02 | stack-sf | gojomo |
1157085 | Fix | DNS records in ARCs should use DNS server IP | 2005-03-04 | stack-sf | gojomo |
1157385 | Fix | Crawler not making progress -- thread deadlock | 2005-03-05 | stack-sf | ia_igor |
1158270 | Fix | isMultibyteEncoding: Uncaught UnsupportedOperationException | 2005-03-07 | stack-sf | ck-heritrix |
1080925 | Fix | MultiThreadedConnectionManager bottleneck | 2004-12-07 | stack-sf | gojomo |
1157372 | Fix | missing space in progress-statistics.log | 2005-03-05 | stack-sf | ia_igor |
1153927 | Fix | npe in ExtractorHTML#innerProcess | 2005-02-28 | stack-sf | stack-sf |
1155641 | Fix | "Illegal response body offset" in ReplayCharSequenceFactory | 2005-03-02 | stack-sf | gojomo |
1154673 | Fix | ensure IPs match from DNS, used in HTTP, logged in ARC | 2005-03-01 | stack-sf | gojomo |
1002138 | Fix | swf extractor flash lib prints glyphcount on stdout | 2004-08-02 | nobody | stack-sf |
1077924 | Fix | crawl.log timestamps out-of-order | 2004-12-02 | gojomo | gojomo |
1066573 | Fix | sometimes job based-on other job uses older job name | 2004-11-15 | gojomo | gojomo |
1102755 | Fix | seeds text area truncates seeds; big seed lists break config | 2005-01-14 | gojomo | gojomo |
1002164 | Fix | OOM hit very early broad-crawling | 2004-08-02 | stack-sf | gojomo |
1006970 | Fix | UI list-ordering inconsistent | 2004-08-10 | gojomo | gojomo |
1092937 | Fix | UI/Settings - Expert Toggle loses user data | 2004-12-29 | nobody | nobody |
1152358 | Fix | OOM in postselector | 2005-02-26 | nobody | orion2598 |
1068403 | Fix | ARCWriter gzip deflate hang | 2004-11-17 | nobody | stack-sf |
1123906 | Fix | ARCWriter alerts if Content-Type is null | 2005-02-16 | nobody | ck-heritrix |
1124029 | Fix | Bad synchronization causes NPE in StatisticsTracker | 2005-02-16 | nobody | ck-heritrix |
1055789 | Fix | ARCWriter 'Gap' errors should be more prominent | 2004-10-27 | stack-sf | gojomo |
1123859 | Fix | Change in ExtractorHTML triggers NullPointerExceptions | 2005-02-16 | nobody | ck-heritrix |
1093073 | Fix | StackOverflowError shouldn't kill crawl | 2004-12-29 | gojomo | gojomo |
1068370 | Fix | [Flash] OOMEs on a particular URL | 2004-11-17 | nobody | stack-sf |
1108153 | Fix | unwritable ARCs directory barely noticeable | 2005-01-23 | nobody | gojomo |
1023929 | Fix | "&" converted to "&" in preselector override regex | 2004-09-07 | gojomo | danavery: |
1083428 | Fix | remove profile function in WUI? | 2004-12-11 | nobody | zhousp |
1068384 | Fix | deleting all(?) from queue corrupts frontier, kills crawl | 2004-11-17 | gojomo | gojomo |
1106469 | Fix | ExtractorCSS regexp taking 'forever' on small document | 2005-01-20 | gojomo | gojomo |
1116204 | Fix | FetchDNS doesn't work (bug in dnsjava) | 2005-02-04 | nobody | nobody |
1103838 | Fix | Redirect problem (Stops crawling after 3) | 2005-01-17 | nobody | nobody |
1119686 | Fix | oversight in CrawlURI; missing check for null | 2005-02-09 | nobody | frodobay |
1060508 | Fix | [uuri] port StringIndexOutOfBoundsExceptionn | 2004-11-04 | stack-sf | stack-sf |
1101831 | Fix | NPE in ROS#record | 2005-01-13 | nobody | stack-sf |
1114285 | Fix | Old profile/jobs won't work with HEAD (1.4) | 2005-02-01 | nobody | stack-sf |
1062621 | Fix | First arc record length is off by one | 2004-11-08 | stack-sf | stack-sf |
1117916 | Fix | PDFParser URL extraction bug | 2005-02-07 | nobody | benlitchfield |
1113977 | Fix | User Agent is tolowercased | 2005-02-01 | nobody | nobody |
1113470 | Fix | Exception in Modules Tab | 2005-01-31 | nobody | nobody |
1109521 | Fix | Hung Thread in StatisticsTracker | 2005-01-25 | stack-sf | ia_igor |
1107304 | Fix | Failed create new job based on job with absolute settings | 2005-01-22 | nobody | frodobay |
1000865 | Fix | Long random pauses where no progress is made | 2004-07-30 | nobody | gojomo |
1095952 | Fix | InvalidJobFileException: Status .. 'RUNNING' | 2005-01-04 | nobody | stack-sf |
1095453 | Fix | heritrix wont start with fedora core 3 | 2005-01-03 | nobody | nobody |
1092135 | Fix | crawl.log hashes wrong for captures > 64K | 2004-12-28 | gojomo | gojomo |
1103133 | Fix | deadlock in ip-politeness requeueing | 2005-01-15 | gojomo | gojomo |
1102771 | Fix | SURTs-from-seeds may lack trailing comma | 2005-01-14 | gojomo | gojomo |
1101396 | Fix | JS extr. does not parse spec. links starting w/ ./ or ../ | 2005-01-12 | nobody | ia_igor |
1100658 | Fix | update to [ 1100467 ] maven 1.0.2 build problem | 2005-01-11 | stack-sf | nobody |
1101138 | Fix | Update ant and httpclient jars | 2005-01-12 | stack-sf | stack-sf |
1098217 | Fix | ReplayCharSequence.toString() is broken | 2005-01-07 | nobody | stack-sf |
1093627 | Fix | [robots] robots.txt midfetch aborted gives open access | 2004-12-30 | nobody | stack-sf |
1093614 | Fix | midfetch abort doesn't | 2004-12-30 | nobody | stack-sf |
1082358 | Fix | [uuri] String index out of range: 0 | 2004-12-09 | stack-sf | stack-sf |
1086554 | Fix | glibc 2.3.2 NPTL hang (Was bdbfrontier stall in...) | 2004-12-16 | stack-sf | stack-sf |
1072035 | Fix | [uuri] Underscore in host messes up port parsing | 2004-11-23 | stack-sf | stack-sf |
1043251 | Fix | better/longer dns retries on lookup failure | 2004-10-08 | gojomo | stack-sf |
1090911 | Fix | NPE in ServerCache | 2004-12-24 | stack-sf | stack-sf |
1080926 | Fix | reducing max-toe-threads has no effect | 2004-12-07 | gojomo | gojomo |
1088788 | Fix | NPE in TextUtils.freeMatcher() | 2004-12-20 | stack-sf | gojomo |
1082570 | Fix | heritrix.log ignored | 2004-12-09 | nobody | stack-sf |
1078503 | Fix | Edit configuration in UI gives NPE | 2004-12-03 | nobody | stack-sf |
1055592 | Fix | terminated crawl still hogging memory, causing OOM | 2004-10-27 | nobody | gojomo |
1081770 | Fix | quick-override accepts domain w/spaces, lost checkboxes | 2004-12-08 | gojomo | gojomo |
1080827 | Fix | Browser hangs when hundreds of seeds | 2004-12-07 | nobody | stack-sf |
1047396 | Fix | OOM in BdbFrontier/nio.Bits -- with plenty of heap left | 2004-10-14 | nobody | gojomo |
1078581 | Fix | DomainSensitiveFrontier never finishes | 2004-12-03 | nobody | stack-sf |
1076251 | Fix | Upgrade bdbje 1.7.0 (WAS: Checkpointer thread ...) | 2004-11-30 | nobody | stack-sf |
1072192 | Fix | bdbfrontier No locks available | 2004-11-23 | nobody | stack-sf |
1031499 | Fix | Deleted pending jobs show as pending in restart. | 2004-09-20 | stack-sf | stack-sf |