org.archive.io.arc
Class ARCReader

java.lang.Object
  extended by org.archive.io.ArchiveReader
      extended by org.archive.io.arc.ARCReader
All Implemented Interfaces:
ARCConstants, ArchiveFileConstants
Direct Known Subclasses:
ARCReaderFactory.CompressedARCReader, ARCReaderFactory.UncompressedARCReader

public abstract class ARCReader
extends ArchiveReader
implements ARCConstants

Get an iterator on an ARC file or get a record by absolute position. ARC files are described here: Arc File Format.

This class knows how to parse an ARC file. Pass it a file path or an URL to an ARC. It can parse ARC Version 1 and 2.

Iterator returns ARCRecord though Iterator.next() is returning java.lang.Object. Cast the return.

Profiling java.io vs. memory-mapped ByteBufferInputStream shows the latter slightly slower -- but not by much. TODO: Test more. Just change ArchiveReader.getInputStream(File, long).

Version:
$Date: 2010-03-10 00:42:08 +0000 (Wed, 10 Mar 2010) $ $Revision: 6786 $
Author:
stack

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.io.ArchiveReader
ArchiveReader.ArchiveRecordIterator, ArchiveReader.RandomAccessBufferedInputStream
 
Field Summary
static java.lang.String[] HEADER_FIELD_NAME_KEYS
          An array of the header field names found in the ARC file header on the 3rd line.
(package private)  java.util.logging.Logger logger
           
 
Fields inherited from class org.archive.io.ArchiveReader
MAX_ALLOWED_RECOVERABLES
 
Fields inherited from interface org.archive.io.arc.ARCConstants
ARC_FILE_EXTENSION, ARC_GZIP_EXTRA_FIELD, ARC_MAGIC_NUMBER, CHECKSUM_FIELD_KEY, CHECKSUM_HEADER_FIELD_KEY, CODE_HEADER_FIELD_KEY, COMPRESSED_ARC_FILE_EXTENSION, DEFAULT_ENCODING, DEFAULT_GZIP_HEADER_LENGTH, DEFAULT_MAX_ARC_FILE_SIZE, DOT_ARC_FILE_EXTENSION, DOT_COMPRESSED_ARC_FILE_EXTENSION, DOT_COMPRESSED_FILE_EXTENSION, FILENAME_FIELD_KEY, FILENAME_HEADER_FIELD_KEY, GZIP_HEADER_BEGIN, HEADER_FIELD_SEPARATOR, IP_HEADER_FIELD_KEY, LINE_SEPARATOR, LOCATION_HEADER_FIELD_KEY, MAX_METADATA_LINE_LENGTH, MINIMUM_RECORD_LENGTH, OFFSET_FIELD_KEY, OFFSET_HEADER_FIELD_KEY, REQUIRED_VERSION_1_HEADER_FIELDS, STATUSCODE_FIELD_KEY, TOKENIZED_PREFIX
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
ARCReader()
           
 
Method Summary
protected  ARCRecord createArchiveRecord(java.io.InputStream is, long offset)
          Create new arc record.
static void createCDXIndexFile(java.lang.String urlOrPath)
          Generate a CDX index file for an ARC file.
 void dump(boolean compress)
          Dump this file on STDOUT
protected  java.util.List<java.lang.String> fixSpaceInURL(java.util.List<java.lang.String> values, int requiredSize)
          Fix space in URLs.
 ARCReader getDeleteFileOnCloseReader(java.io.File f)
           
 java.lang.String getDotFileExtension()
           
 java.lang.String getFileExtension()
           
 java.lang.String getVersion()
          Returns version of this ARC file.
protected  void gotoEOR(ArchiveRecord record)
          Skip over any trailing new lines at end of the record so we're lined up ready to read the next.
protected  boolean isAlignedOnFirstRecord()
           
protected  boolean isDate(java.lang.String date)
           
protected  boolean isLegitimateIPValue(java.lang.String ip)
           
protected  boolean isNumber(java.lang.String n)
           
 boolean isParseHttpHeaders()
           
static void main(java.lang.String[] args)
          Command-line interface to ARCReader.
protected static void output(ARCReader reader, java.lang.String format)
          Write out the arcfile.
protected  boolean output(java.lang.String format)
           
 boolean outputRecord(java.lang.String format)
          Output passed record using passed format specifier.
protected  void setAlignedOnFirstRecord(boolean alignedOnFirstRecord)
           
 void setParseHttpHeaders(boolean parse)
           
 
Methods inherited from class org.archive.io.ArchiveReader
cdxOutput, cleanupCurrentRecord, close, currentRecord, get, get, getCurrentRecord, getFileName, getIn, getInputStream, getInputStream, getLogger, getOptions, getReaderIdentifier, getStrippedFileName, getStrippedFileName, getTrueOrFalse, initialize, isCompressed, isDigest, isStrict, isValid, iterator, logStdErr, outputRecord, rewind, setCompressed, setDigest, setIn, setReaderIdentifier, setStrict, setVersion, stripExtension, validate, validate
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

java.util.logging.Logger logger

HEADER_FIELD_NAME_KEYS

public static final java.lang.String[] HEADER_FIELD_NAME_KEYS
An array of the header field names found in the ARC file header on the 3rd line. We used to read these in from the arc file first record 3rd line but now we hardcode them for sake of improved performance.

Constructor Detail

ARCReader

ARCReader()
Method Detail

gotoEOR

protected void gotoEOR(ArchiveRecord record)
                throws java.io.IOException
Skip over any trailing new lines at end of the record so we're lined up ready to read the next.

Specified by:
gotoEOR in class ArchiveReader
Parameters:
record -
Throws:
java.io.IOException

createArchiveRecord

protected ARCRecord createArchiveRecord(java.io.InputStream is,
                                        long offset)
                                 throws java.io.IOException
Create new arc record. Encapsulate housekeeping that has to do w/ creating a new record.

Call this method at end of constructor to read in the arcfile header. Will be problems reading subsequent arc records if you don't since arcfile header has the list of metadata fields for all records that follow.

When parsing through ARCs writing out CDX info, we spend about 38% of CPU in here -- about 30% of which is in getTokenizedHeaderLine -- of which 16% is reading.

Specified by:
createArchiveRecord in class ArchiveReader
Parameters:
is - InputStream to use.
offset - Absolute offset into arc file.
Returns:
An arc record.
Throws:
java.io.IOException

getVersion

public java.lang.String getVersion()
Returns version of this ARC file. Usually read from first record of ARC. If we're reading without having first read the first record -- e.g. random access into middle of an ARC -- then version will not have been set. For now, we return a default, version 1.1. Later, if more than just one version of ARC, we could look at such as the meta line to see what version of ARC this is.

Overrides:
getVersion in class ArchiveReader
Returns:
Version of this ARC file.

isDate

protected boolean isDate(java.lang.String date)

isNumber

protected boolean isNumber(java.lang.String n)

isLegitimateIPValue

protected boolean isLegitimateIPValue(java.lang.String ip)

fixSpaceInURL

protected java.util.List<java.lang.String> fixSpaceInURL(java.util.List<java.lang.String> values,
                                                         int requiredSize)
Fix space in URLs. The ARCWriter used to write into the ARC URLs with spaces in them. See [ 1010966 ] crawl.log has URIs with spaces in them. This method does fix up on such headers converting all spaces found to '%20'.

Parameters:
values - List of metadata values.
requiredSize - Expected size of resultant values list.
Returns:
New list if we successfully fixed up values or original if fixup failed.

isAlignedOnFirstRecord

protected boolean isAlignedOnFirstRecord()

setAlignedOnFirstRecord

protected void setAlignedOnFirstRecord(boolean alignedOnFirstRecord)

isParseHttpHeaders

public boolean isParseHttpHeaders()
Returns:
Returns the parseHttpHeaders.

setParseHttpHeaders

public void setParseHttpHeaders(boolean parse)
Parameters:
parse - The parseHttpHeaders to set.

getFileExtension

public java.lang.String getFileExtension()
Specified by:
getFileExtension in class ArchiveReader

getDotFileExtension

public java.lang.String getDotFileExtension()
Specified by:
getDotFileExtension in class ArchiveReader

output

protected boolean output(java.lang.String format)
                  throws java.io.IOException,
                         java.text.ParseException
Overrides:
output in class ArchiveReader
Parameters:
format - Format to use outputting.
Returns:
True if handled.
Throws:
java.io.IOException
java.text.ParseException

outputRecord

public boolean outputRecord(java.lang.String format)
                     throws java.io.IOException
Description copied from class: ArchiveReader
Output passed record using passed format specifier.

Overrides:
outputRecord in class ArchiveReader
Parameters:
format - What format to use outputting.
Returns:
True if handled.
Throws:
java.io.IOException

dump

public void dump(boolean compress)
          throws java.io.IOException,
                 java.text.ParseException
Description copied from class: ArchiveReader
Dump this file on STDOUT

Specified by:
dump in class ArchiveReader
Throws:
java.io.IOException
java.text.ParseException

getDeleteFileOnCloseReader

public ARCReader getDeleteFileOnCloseReader(java.io.File f)
Specified by:
getDeleteFileOnCloseReader in class ArchiveReader
Returns:
an ArchiveReader that will delete a local file on close. Used when we bring Archive files local and need to clean up afterward.

output

protected static void output(ARCReader reader,
                             java.lang.String format)
                      throws java.io.IOException,
                             java.text.ParseException
Write out the arcfile.

Parameters:
reader -
format - Format to use outputting.
Throws:
java.io.IOException
java.text.ParseException

createCDXIndexFile

public static void createCDXIndexFile(java.lang.String urlOrPath)
                               throws java.io.IOException,
                                      java.text.ParseException
Generate a CDX index file for an ARC file.

Parameters:
urlOrPath - The ARC file to generate a CDX index for
Throws:
java.io.IOException
java.text.ParseException

main

public static void main(java.lang.String[] args)
                 throws org.apache.commons.cli.ParseException,
                        java.io.IOException,
                        java.text.ParseException
Command-line interface to ARCReader. Here is the command-line interface:
 usage: java org.archive.io.arc.ARCReader [--offset=#] ARCFILE
  -h,--help      Prints this message and exits.
  -o,--offset    Outputs record at this offset into arc file.

See in $HERITRIX_HOME/bin/arcreader for a script that'll take care of classpaths and the calling of ARCReader.

Outputs using a pseudo-CDX format as described here: CDX Legent and here Example. Legend used in below is: 'CDX b e a m s c V (or v if uncompressed) n g'. Hash is hard-coded straight SHA-1 hash of content.

Parameters:
args - Command-line arguments.
Throws:
org.apache.commons.cli.ParseException - Failed parse of the command line.
java.io.IOException
java.text.ParseException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.