General
Common Problems
mkdir
. Why?
text/html
and nothing else. Can I do it?
References
Heritrix (sometimes spelled heretrix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture and preserve them for the benefit of future researchers and generations, this name seemed apt.
Please see the documentation on the Heritrix Wiki. The document An Introduction to Heritrix describes the Heritrix architecture.
Yes. Start by checking out the Heritrix User Manual.
Yes -- especially if you have experience with Java, open-source projects, and web protocols! Drop us a message or join our mailing lists for more details. See also the Heritrix Developer Manual.
The GNU LESSER GENERAL PUBLIC LICENSE. For discussion of 3rd party applications making use of LGPL code, see David Turner on The LGPL and Java.
Heritrix does not depend on a specific Linux distribution to function and should work on any distro as long as a suitable Java Virtual Machine can be installed on it. We know that Heritrix has been successfully deployed on Red Hat 7.2, recent fedora core versions (2 and 4), as well as on suse 9.3. Heritrix is known to work well with kernel versions 2.4.x. With kernel versions 2.6.x there are issues when using JVMs other then the release version of the SUN 1.5 jdk. See Why do unit tests fail when I build? below. There are also issues when using the linux NPTL threading model, particularly with older glibcs (i.e. debian). See Glibc 2.3.2 and NPTL in the release notes.
CMD
script has been added to $HERITRIX_HOME/bin lately that does most of
what the $HERITRIX_HOME/bin/heritrix does loading jars in the right
order, etc. (Originally contributed by Eric Jensen
and finished by Max Schöfmann). Max has also written up this page:
Web
crawling: Using Heritrix on Windows.
Also see Crawler Stalling on Windows, 2085
and the items below that pertain to windows:
dns and mkdir.
text/html
and nothing else. Can I do it?ContentTypeRegExpFilter
filter
as a midfetch
filter to the http fetcher.
This filter will be checked after the response headers
have been downloaded but before the response content
is fetched. Configure it to only allow through documents of the
Content-Type desired. Apply the same filter at the
writer stage of processing to eliminate recording of
response headers in ARCs. See the
User Manual
(Prerequisite URLs by-pass the midfetch filters so it is not possible
to filter out robots.txt using this mechanism).
On linux, a usual upper bound is 1024 file descriptors per process. To change this upper bound, there's a couple of things you can do.
If running the crawler as non-root (recommended),
you can configure limits in /etc/security/limits.conf
. For
example you can setup open files limit for all users in webcrawler group
as:
# Each line describes a limit for a user in the form: # # domain type item value # @webcrawler hard nofile 32768
Otherwise, running as root (You need to be root to up ulimits),
you can do the following:
# (ulimit -n 4096; JAVA_OPTS=-Xmx320 bin/heritrix -p 9876)
to up the ulimit for the heritrix process only.
Below is a rough accounting of FDs used in heritrix 1.0.x.
In Heritrix, the number of concurrent threads is configurable. The default frontier implementation allocates a thread per server. Per server, the frontier keeps a disk-backed queue. Disk-backed queues maintain three backing files with '.qin', '.qout', and '.top' suffixes (One to read from while the other is being written to as well as queue head file). So, per thread there will be at least three file descriptors occupied when queues need to persist to disk.
Apart from the above per thread FD cost, there is a set FD cost instantiating the crawler:
If using 64-bit JVM, see Gordon's note to the list on 12/19/2005, Re: Large crawl experience (like, 500M links).
See the note in [ 896772 ] "Site-first"/'frontline' prioritization and this Release Note, 5.1.1 Crawl Size Upper Bounds. See this note by Kris from the list, 1027 for how to mitigate memory-use when using HostQueuesFrontier. The advice is less applicable if using a post-1.2.0, BdbFrontier Heritrix. See sections 'Crawl Size Upper Bounds Update' in the Release Notes.
Yes. See RE: [archive-crawler] Inserting information to MYSQL during crawl for pointers on how but also see the rest of this thread for why you might rather do database insertion post-crawl rather than during.
See MirrorWriterProcessor. It writes a file per URL to the filesystem using a name that is a derivative of the requested URL.
You'll need to configure Eclipse for Java 5.0 compliance to get rid of the assert errors (prior to Java 5.0 'assert' was not a keyword and currently Eclipse defaults 1.3). This can be done by going into "Window>Preferences>Java/Compiler>Compliance and Classfiles" and setting "Compiler compliance level" to 5.0. Make sure the 'Use default compliance level' is UNCHECKED and that the 'Generated .class files compatibility' and 'Source compatibility' are also set to 5.0.
The crawl can get hung up on sites that are actually down or are non-responsive. Manual intervention is necessary in such cases. Study the frontier to get a picture of what is left to be crawled. Looking at the local errors log will give let you see the problems with currently crawled URIs. Along with robots.txt retries, you will probably also see httpclient timeouts. In general you want to look for repetition of problems with particular host/URIs.
Grepping the local errors log is a bit tricky because of the shape of its content. Its recommend that you first "flatten" the local errors file. Here's an example :
% cat local-errors.log | tr -d \\\n | perl -pe 's/([0-9]{17} )/\n$1/g'
This will remove all new lines and then add a new line in front of 17-digit dates (hopefully only 17-digit tokens followed by a space are dates.). The result is one line per entry with a 17-digit date prefix. Makes it easier to parse.
To eliminate URIs for unresponsive hosts from the frontier queue, pause the crawl and block the fetch from that host by creating a new per-host setting -- an override -- in the preselector processor.
Also, check for any hung threads. This does not happen anymore (0.8.0+). Check the threads report for threads that have been active for a long time but that should not be: i.e. documents being downloaded are small in size.
Once you've identified hung threads, kill and replace it.
Traps are infinite page sources put up to occupy ('trap') a crawler. Traps may be as innocent as a calendar that returns pages years into the future or not-so-innocent http://spiders.must.die.net/. Traps are created by CGIs/server-side code that dynamically conjures 'nonsense' pages or else exploits combination of soft and relative links to generate URI paths of infinite variety and depth. Once identified, use filters to guard against falling in.
Another trap that works by feeding documents of infinite sizes to the crawler is http://yahoo.domain.com.au/tools/spiderbait.aspx* as in http://yahoo.domain.com.au/tools/spiderbait.aspx?state=vic or http://yahoo.domain.com.au/tools/spiderbait.aspx?state=nsw. To filter out infinite document size traps, add a maximum doc. size filter to your crawl order.
In the past crawls were stopped when we ran into "junk." An example of what we mean by "junk" is the crawler stuck in a web calender crawling the year 2020. Nowadays, if "junk" is detected, we'll pause the crawl and set filters to eliminate "junk" and then resume (Eliminated URIs will show in the logs. Helps when doing post-crawl analysis).
To help guard against the crawling of "junk" setup the pathological and path-depth filters. This will also help the crawler avoid traps. Recommended values for pathological filter is 3 repetitions of same pattern -- e.g. /images/images/images/... -- and for path-depth, a value of 20.
Try out Heritrix bundled as a WAR file. Use the maven 'war' target to produce a heritrix.war or pull the war from the build downloads page (Click on the 'Build Artifacts' link). Heritrix as a WAR is available in HEAD only (post-1.2.0) and currently has 'experimental' status (i.e. It needs exercising).
Sure. Make sure all that is in the Heritrix lib directory is on your CLASSPATH (ensuring the heritrix.jar is found first). Thereafter, using HEAD (post-1.2.0), doing the following should get you a long ways:
Heritrix h = new Heritrix(); h.launch();
A JMX interface has been added to the crawler. The intent is that all features of the UI are exposed in JMX so Heritrix can be remotely controlled.
A cmdline control utility that makes use of the JMX API has been
added. The script can
be found in the scripts directory. Its
packaged as a jar file named cmdline-jmxclient.X.X.X.jar
.
It has no dependencies on other jars being found in its classpath so it
can be safely moved from this location. Its only dependency is jdk1.5.0.
To learn more, obtain client usage by typing the following:
${PATH_TO_JDK1.5.0}/bin/java -jar cmdline-jmxclient.X.X.X.jar
.
See also cmdline-jmxclient
to learn more.
See this, 1182, Tom Emerson note for a suggestion.
Its also possible post-1.4.0 to run multiple Heritrix instances
in a single JVM. Browse to /local-instances.jsp
.
While the mascots of web crawlers have usually been spider-related, I'd rather think of Heritrix as a centipede or millipede: fast and many-segmented.
Anything that "crawls" over many things at once would presumably have a lot of feet and toes. Heritrix will often use many hundreds of worker threads to "crawl", but 'WorkerThread' or 'CrawlThread' seem mundane.
So instead, we have 'ToeThreads'. :)
Below is listing of users of Heritrix (To qualify for inclusion in the list below, send a description of a couple of lines to the mailing list).
See the
See the ARC section in the Developer Manual.
The following are all worth at least a quick skim: