13. Internet Archive ARC files

By default, heritrix writes all its crawled to disk using ARCWriterProcessor. This processor writes the found crawl content as Internet Archive ARC files. The ARC file format is described here: Arc File Format. Heritrix writes version 1 ARC files.

By default, Heritrix writes compressed version 1 ARC files. The compression is done with gzip, but rather compress the ARC as a whole, instead, each ARC Record is in turn gzipped. All gzipped records are concatenated together to make up a file of multiple gzipped members. This concatenation, it turns out, is a legal gzip file; you can give it to gzip and it will undo each compressed record in turn. Its an amenable compression technique because it allows random seek to a single record and the undoing of that record only. Otherwise, first the total ARC would have to be uncompressed to get any one record.

Pre-release of Heritrix 1.0, an amendment was made to the ARC file version 1 format to allow writing of extra metadata into first record of an ARC file. This extra metadata is written as XML. The XML Schema used by metadata instance documents can be found at http://archive.org/arc/1.0/xsd. The schema is documented here.

If the extra XML metadata info is present, the second '<reserved>' field of the second line of version 1 ARC files will be changed from '0' to '1': i.e. ARCs with XML metadata are version '1.1'.

If present, the ARC file metadata record body will contain at least the following fields (Later revisions to the ARC may add other fields):

  1. Software and software version used creating the ARC file. Example: 'heritrix 0.7.1 http://crawler.archive.org'.

  2. The IP of the host that created the ARC file. Example: '103.1.0.3'.

  3. The hostname of the host that created the ARC file. Example: 'debord.archive.org'.

  4. Contact name of the crawl operator. Default value is 'admin'.

  5. The http-header 'user-agent' field from the crawl-job order file. This field is recorded here in the metadata only until the day ARCs record the HTTP request made. Example: 'os-heritrix/0.7.0 (+http://localhost.localdomain)'.

  6. The http-header 'from' from the crawl-job order file. This field is recorded here in the metadata only until the day ARCs record the HTTP request made. Example: 'webmaster@localhost.localdomain'.

  7. The 'description' from the crawl-job order file. Example: 'Heritrix integrated selftest'

  8. The Robots honoring policy. Example: 'classic'.

  9. Organization on whose behalf the operator is running the crawl. Example 'Internet Archive'.

  10. The recipient of the crawl ARC resource if known. Example: 'Library of Congress'.

13.1. ARC File Naming

When heritrix creates ARC files, it uses the following template naming them:

        <OPERATOR SPECIFIED> '-' <12 DIGIT TIMESTAMP> '-' <SERIAL NO.> '-' <FQDN HOSTNAME> '.arc' | '.gz'
        
... where <OPERATOR SPECIFIED> is any operator specified text, <SERIAL NO> is a zero-padded 5 digit number and <FQDN HOSTNAME> is the fully-qualified domainname on which the crawl was run.

13.2. Reading arc files

ARCReader can be used reading arc files. It has a command line interface that can be used to print out meta info in a pseudo CDX format and for doing random access getting of arc records (The command-line interface is described in the main method javadoc comments).

Netarchive.dk have also developed arc reading and writing tools.

Tom Emerson of Basis Technology has put up a project on sourceforge to host a BSD-Licensed C++ ARC reader called libarc (Its since been moved to archive-access).

The French National Library (BnF) has also released a GPL perl/c ARC Reader. See BAT for documentation and where to download..

See Hedaern for python readers/writers and for a skeleton webapp that allows querying by timestamp+date as well as full-text search of ARC content..

13.3. Writing arc files

Here is an example arc writer application: Nedlib To ARC conversion. It rewrites nedlib files as arcs.

13.4. Searching ARCS

Check out the NutchWAX+WERA bundle.