Heritrix - Frequently Asked Questions

Frequently Asked Questions

General

What does "Heritrix" mean?
Where can I go to get a good introduction/overview of Heritrix?
I need to crawl/archive a set of websites, can I use Heritrix?
I'm a developer, can I help?
What license does Heritrix use?

Common Problems

How do I know when heritrix is done with an ARC file?
Are there known limitations?
Why do unit tests fail when I build?
Which Linux distribution should I use to run Heritrix and which kernel version do I need?
How do I run Heritrix on windows.
The crawler gets dns fine but nothing subsequently. Why?
The crawler, running on windows, complains it cannot mkdir. Why?
I only want to download text/html and nothing else. Can I do it?
Where do I go to learn about these cryptic crawl.log status codes (-6, -7, -9998, etc.)?
Why do I get java.io.FileNotFoundException...(Too many open files) or java.io.IOException...(Too many open files)?
Why do I get an OutOfMemoryException ten minutes after starting a broad scoped crawl?
Can I insert the crawl download directly into a MYSQL database instead of into an ARC file on disk while crawling?
Does Heritrix have to write ARC files?
Why when running heritrix in eclipse does it complain about the 'assert' keyword?
Why won't my crawl finish?
What are crawler traps?
What do I do to avoid crawling "junk"?
Can Heritrix be made run in Tomcat (or Websphere, or Resin, or Weblogic)? Does it have to be run embedded in Jetty?
Can I embedd Heritrix in another application?
Can I stop/pause and get status from a running Heritrix using command-line tools? Can I remote control Heritrix?
What techniques exist for crawling more than one job at time?
Why are the main crawler worker threads called "ToeThreads"??
Who is using Heritrix?
I've downloaded all these ARC files, now what?

References

Where can I learn more about what is stored at the Internet Archive, the ARC file format, and tools for manipulating ARC files?
Where can I get more background on Heritrix and learn more about "crawling" in general?

General

What does "Heritrix" mean?: Heritrix (sometimes spelled heretrix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture and preserve them for the benefit of future researchers and generations, this name seemed apt.

Where can I go to get a good introduction/overview of Heritrix?: Please see the documentation on the Heritrix Wiki. The document An Introduction to Heritrix describes the Heritrix architecture.

I need to crawl/archive a set of websites, can I use Heritrix?: Yes. Start by checking out the Heritrix User Manual.

I'm a developer, can I help?: Yes -- especially if you have experience with Java, open-source projects, and web protocols! Drop us a message or join our mailing lists for more details. See also the Heritrix Developer Manual.

What license does Heritrix use?: The GNU LESSER GENERAL PUBLIC LICENSE. For discussion of 3rd party applications making use of LGPL code, see David Turner on The LGPL and Java.

Common Problems

How do I know when heritrix is done with an ARC file?: ARCs that are currently in use will have a '.open' suffix. Those without such a suffix are fair-game for copying. Also see [988277] [Need feedback] "Done with ARC file" event for a description for how to enable logging of opening and closing of arcs. See also the conf/hertrix.properties for how to enable console logging going to a FileHandler as well as to heritrix_out.log.

Are there known limitations?: See the Release Notes page.

Why do unit tests fail when I build?: You're probably on a platform other than linux (or using a 2.6.x kernel and a JVM that is other than the release version of the SUN 1.5 jdk). See sections 4.1.3/4.1.4 in Release Notes page.

Which Linux distribution should I use to run Heritrix and which kernel version do I need?: Heritrix does not depend on a specific Linux distribution to function and should work on any distro as long as a suitable Java Virtual Machine can be installed on it. We know that Heritrix has been successfully deployed on Red Hat 7.2, recent fedora core versions (2 and 4), as well as on suse 9.3. Heritrix is known to work well with kernel versions 2.4.x. With kernel versions 2.6.x there are issues when using JVMs other then the release version of the SUN 1.5 jdk. See Why do unit tests fail when I build? below. There are also issues when using the linux NPTL threading model, particularly with older glibcs (i.e. debian). See Glibc 2.3.2 and NPTL in the release notes.

How do I run Heritrix on windows.: Before you begin, Heritrix is not supported on windows -- See requirements -- mostly because we don't have the resources to support any more than the linux platform we use internally at the Internet Archive. That said, a CMD script has been added to $HERITRIX_HOME/bin lately that does most of what the $HERITRIX_HOME/bin/heritrix does loading jars in the right order, etc. (Originally contributed by Eric Jensen and finished by Max Schöfmann). Max has also written up this page: Web crawling: Using Heritrix on Windows. Also see Crawler Stalling on Windows, 2085 and the items below that pertain to windows: dns and mkdir.

The crawler gets dns fine but nothing subsequently. Why?: If you are running on windows, it may be because the ordering of jars on the classpath is wrong. See Why crawler [sic] nothing ???.

The crawler, running on windows, complains it cannot mkdir. Why?: See 1880

I only want to download text/html and nothing else. Can I do it?: Tom Emerson describes one technique here, Focusing on HTML. You can also add a filter that excludes all filters that end in other than 'html|htm', etc., or, if you want to instead look at document mimetypes, you can Add a ContentTypeRegExpFilter filter as a midfetch filter to the http fetcher. This filter will be checked after the response headers have been downloaded but before the response content is fetched. Configure it to only allow through documents of the Content-Type desired. Apply the same filter at the writer stage of processing to eliminate recording of response headers in ARCs. See the User Manual (Prerequisite URLs by-pass the midfetch filters so it is not possible to filter out robots.txt using this mechanism).

Where do I go to learn about these cryptic crawl.log status codes (-6, -7, -9998, etc.)?: See the User Manual Glossary.

Why do I get java.io.FileNotFoundException...(Too many open files) or java.io.IOException...(Too many open files)?

On linux, a usual upper bound is 1024 file descriptors per process. To change this upper bound, there's a couple of things you can do.

If running the crawler as non-root (recommended), you can configure limits in /etc/security/limits.conf. For example you can setup open files limit for all users in webcrawler group as:

# Each line describes a limit for a user in the form:
#
# domain    type    item    value
#
@webcrawler     hard    nofile  32768

Otherwise, running as root (You need to be root to up ulimits), you can do the following: # (ulimit -n 4096; JAVA_OPTS=-Xmx320 bin/heritrix -p 9876) to up the ulimit for the heritrix process only.

Below is a rough accounting of FDs used in heritrix 1.0.x.

In Heritrix, the number of concurrent threads is configurable. The default frontier implementation allocates a thread per server. Per server, the frontier keeps a disk-backed queue. Disk-backed queues maintain three backing files with '.qin', '.qout', and '.top' suffixes (One to read from while the other is being written to as well as queue head file). So, per thread there will be at least three file descriptors occupied when queues need to persist to disk.

Apart from the above per thread FD cost, there is a set FD cost instantiating the crawler:

The JVM, its native shared libs and jars count for about 40 FDs.
There are about 20 or so heritrix jars and 2 webapps.
There are about 10-20 heritrix logging files counting counting lock files.
Open ARC files.
Miscellaneous sockets, /dev/null, /dev/random, and stderr/stdout make for 10 or 20 more FDs.

Why do I get an OutOfMemoryException ten minutes after starting a broad scoped crawl?

If using 64-bit JVM, see Gordon's note to the list on 12/19/2005, Re: Large crawl experience (like, 500M links).

See the note in [ 896772 ] "Site-first"/'frontline' prioritization and this Release Note, 5.1.1 Crawl Size Upper Bounds. See this note by Kris from the list, 1027 for how to mitigate memory-use when using HostQueuesFrontier. The advice is less applicable if using a post-1.2.0, BdbFrontier Heritrix. See sections 'Crawl Size Upper Bounds Update' in the Release Notes.

Can I insert the crawl download directly into a MYSQL database instead of into an ARC file on disk while crawling?: Yes. See RE: [archive-crawler] Inserting information to MYSQL during crawl for pointers on how but also see the rest of this thread for why you might rather do database insertion post-crawl rather than during.

Does Heritrix have to write ARC files?: See MirrorWriterProcessor. It writes a file per URL to the filesystem using a name that is a derivative of the requested URL.

Why when running heritrix in eclipse does it complain about the 'assert' keyword?: You'll need to configure Eclipse for Java 5.0 compliance to get rid of the assert errors (prior to Java 5.0 'assert' was not a keyword and currently Eclipse defaults 1.3). This can be done by going into "Window>Preferences>Java/Compiler>Compliance and Classfiles" and setting "Compiler compliance level" to 5.0. Make sure the 'Use default compliance level' is UNCHECKED and that the 'Generated .class files compatibility' and 'Source compatibility' are also set to 5.0.

Why won't my crawl finish?

The crawl can get hung up on sites that are actually down or are non-responsive. Manual intervention is necessary in such cases. Study the frontier to get a picture of what is left to be crawled. Looking at the local errors log will give let you see the problems with currently crawled URIs. Along with robots.txt retries, you will probably also see httpclient timeouts. In general you want to look for repetition of problems with particular host/URIs.

Grepping the local errors log is a bit tricky because of the shape of its content. Its recommend that you first "flatten" the local errors file. Here's an example :

% cat  local-errors.log | tr -d \\\n | perl -pe 's/([0-9]{17} )/\n$1/g'

This will remove all new lines and then add a new line in front of 17-digit dates (hopefully only 17-digit tokens followed by a space are dates.). The result is one line per entry with a 17-digit date prefix. Makes it easier to parse.

To eliminate URIs for unresponsive hosts from the frontier queue, pause the crawl and block the fetch from that host by creating a new per-host setting -- an override -- in the preselector processor.

Also, check for any hung threads. This does not happen anymore (0.8.0+). Check the threads report for threads that have been active for a long time but that should not be: i.e. documents being downloaded are small in size.

Once you've identified hung threads, kill and replace it.

What are crawler traps?

Traps are infinite page sources put up to occupy ('trap') a crawler. Traps may be as innocent as a calendar that returns pages years into the future or not-so-innocent http://spiders.must.die.net/. Traps are created by CGIs/server-side code that dynamically conjures 'nonsense' pages or else exploits combination of soft and relative links to generate URI paths of infinite variety and depth. Once identified, use filters to guard against falling in.

Another trap that works by feeding documents of infinite sizes to the crawler is http://yahoo.domain.com.au/tools/spiderbait.aspx* as in http://yahoo.domain.com.au/tools/spiderbait.aspx?state=vic or http://yahoo.domain.com.au/tools/spiderbait.aspx?state=nsw. To filter out infinite document size traps, add a maximum doc. size filter to your crawl order.

See What to do when I'm crawling "junk"?

What do I do to avoid crawling "junk"?

In the past crawls were stopped when we ran into "junk." An example of what we mean by "junk" is the crawler stuck in a web calender crawling the year 2020. Nowadays, if "junk" is detected, we'll pause the crawl and set filters to eliminate "junk" and then resume (Eliminated URIs will show in the logs. Helps when doing post-crawl analysis).

To help guard against the crawling of "junk" setup the pathological and path-depth filters. This will also help the crawler avoid traps. Recommended values for pathological filter is 3 repetitions of same pattern -- e.g. /images/images/images/... -- and for path-depth, a value of 20.

Can Heritrix be made run in Tomcat (or Websphere, or Resin, or Weblogic)? Does it have to be run embedded in Jetty?: Try out Heritrix bundled as a WAR file. Use the maven 'war' target to produce a heritrix.war or pull the war from the build downloads page (Click on the 'Build Artifacts' link). Heritrix as a WAR is available in HEAD only (post-1.2.0) and currently has 'experimental' status (i.e. It needs exercising).

Can I embedd Heritrix in another application?

Sure. Make sure all that is in the Heritrix lib directory is on your CLASSPATH (ensuring the heritrix.jar is found first). Thereafter, using HEAD (post-1.2.0), doing the following should get you a long ways:

Heritrix h = new Heritrix();
        h.launch();

You'll then need to have your program hangaround while the crawl runs. See message1276 for an example. See also the answer to the next question and this page up on our wiki, Embedding Heritrix.

Can I stop/pause and get status from a running Heritrix using command-line tools? Can I remote control Heritrix?

A JMX interface has been added to the crawler. The intent is that all features of the UI are exposed in JMX so Heritrix can be remotely controlled.

A cmdline control utility that makes use of the JMX API has been added. The script can be found in the scripts directory. Its packaged as a jar file named cmdline-jmxclient.X.X.X.jar. It has no dependencies on other jars being found in its classpath so it can be safely moved from this location. Its only dependency is jdk1.5.0. To learn more, obtain client usage by typing the following: ${PATH_TO_JDK1.5.0}/bin/java -jar cmdline-jmxclient.X.X.X.jar. See also cmdline-jmxclient to learn more.

What techniques exist for crawling more than one job at time?

See this, 1182, Tom Emerson note for a suggestion.

Its also possible post-1.4.0 to run multiple Heritrix instances in a single JVM. Browse to /local-instances.jsp.

Why are the main crawler worker threads called "ToeThreads"??

While the mascots of web crawlers have usually been spider-related, I'd rather think of Heritrix as a centipede or millipede: fast and many-segmented.

Anything that "crawls" over many things at once would presumably have a lot of feet and toes. Heritrix will often use many hundreds of worker threads to "crawl", but 'WorkerThread' or 'CrawlThread' seem mundane.

So instead, we have 'ToeThreads'. :)

Who is using Heritrix?

Below is listing of users of Heritrix (To qualify for inclusion in the list below, send a description of a couple of lines to the mailing list).

The National and University Library of Iceland: Crawls the entire .is domain (~11,000 domains) using Heritrix. Has performed complete snapshot using Heritrix 1.0.4 (35million URIs) and plans on running three more snapshots in 2005. See 1385.
The National Library of Finland: Has used Heritrix to crawl Finnish museum sites and sites pertaining to the June 2004 European parliament elections. The main crawl done in 2004 was of Finnish university sites (~4million URLs). Kaisa supplies more detail on how this larger crawl was done: 1406.
metainfo: Geometa.info is a search machine for spatially related geo-data, geo-services and geo-news for Switzerland, Germany and Austria. We use Heritrix with specialised plugins to find geo-relevant datas and websites. This are formats like Geotiff-, GML-, Interlis-, ESRI-files, WFS- or WMS-services and other sites with georelevant content. Geometabot (Heritrix) is the feeder for the Lucene search engine which provides the coresearchservice for geometa.info.
Saurabh Pathak and Donna Bergmark have written a module for Heritrix that asks of a Rainbow classifier if a page should be crawled or not. See 1905 for their announcement of the project with links to src and HOWTO documentation.

I've downloaded all these ARC files, now what?: See the Developer's Manual for more on ARCs and tools for reading and writing them. There are also tools for searching ARC collections available over in the archive-access project. Checkout the nutch-based NutchWAX and its companion viewer application, WERA.

References

Where can I learn more about what is stored at the Internet Archive, the ARC file format, and tools for manipulating ARC files?: See the ARC section in the Developer Manual.

Where can I get more background on Heritrix and learn more about "crawling" in general?

The following are all worth at least a quick skim:

The Wikipedia Webcrawler page offers a nice introduction on general crawling problem. It has a good overview of current, most cited literature.
Mercator: A Scalable, Extensible Web Crawler is an overview of the original Mercator design, which the Heritrix crawler parallels in many ways.
High-performance Web Crawling is info on experience scaling Mercator.
Performance Limitations of the Java Core Libraries is info on Mercator's experience working around Java problems and bottlenecks. Fortunately, many of these issues have been improved for us by later JVMs and Java core API updates -- but some of these are still issues, and in any case it gives a good flavor for the kinds of problems and profiling one might need to do.
Ubicrawler, a scalable distributed web crawler.
The Viuva Negra crawler paper describes common architectures and common issues encountered crawling as introduction to the VN crawler. The paper ends with comparison of the various crawlers from the literature.
Towards Web-Scale Web Archeaology is a higher-level view, not as focused on crawling details, but rather the post-crawl needs that motivate crawling in the first place.
A number of other potentially interesting papers are linked off the "crawl-links.html" file in the YahooGroups files section...
Msg1498 is a note from the list on page similarity/containment issues.
Thesis paper on creation specialized Frontier and other modules for Heritrix by Kristinn Sigurdsson: Adaptive Revisiting with Heritrix

Overview

Project Documentation

Frequently Asked Questions

General

Common Problems

References