org.archive.io.arc
Class ARCRecord

java.lang.Object
  extended by java.io.InputStream
      extended by org.archive.io.ArchiveRecord
          extended by org.archive.io.arc.ARCRecord
All Implemented Interfaces:
java.io.Closeable, ARCConstants, ArchiveFileConstants

public class ARCRecord
extends ArchiveRecord
implements ARCConstants

An ARC file record. Does not compass the ARCRecord metadata line, just the record content.

Author:
stack

Field Summary
 
Fields inherited from class org.archive.io.ArchiveRecord
digest
 
Fields inherited from interface org.archive.io.arc.ARCConstants
ARC_FILE_EXTENSION, ARC_GZIP_EXTRA_FIELD, ARC_MAGIC_NUMBER, CHECKSUM_FIELD_KEY, CHECKSUM_HEADER_FIELD_KEY, CODE_HEADER_FIELD_KEY, COMPRESSED_ARC_FILE_EXTENSION, DEFAULT_ENCODING, DEFAULT_GZIP_HEADER_LENGTH, DEFAULT_MAX_ARC_FILE_SIZE, DOT_ARC_FILE_EXTENSION, DOT_COMPRESSED_ARC_FILE_EXTENSION, DOT_COMPRESSED_FILE_EXTENSION, FILENAME_FIELD_KEY, FILENAME_HEADER_FIELD_KEY, GZIP_HEADER_BEGIN, HEADER_FIELD_SEPARATOR, IP_HEADER_FIELD_KEY, LINE_SEPARATOR, LOCATION_HEADER_FIELD_KEY, MAX_METADATA_LINE_LENGTH, MINIMUM_RECORD_LENGTH, OFFSET_FIELD_KEY, OFFSET_HEADER_FIELD_KEY, REQUIRED_VERSION_1_HEADER_FIELDS, STATUSCODE_FIELD_KEY, TOKENIZED_PREFIX
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
ARCRecord(java.io.InputStream in, ArchiveRecordHeader metaData)
          Constructor.
ARCRecord(java.io.InputStream in, ArchiveRecordHeader metaData, int bodyOffset, boolean digest, boolean strict, boolean parseHttpHeaders)
          Constructor.
 
Method Summary
 void dumpHttpHeader()
           
 int getBodyOffset()
           
protected  java.lang.String getDigest4Cdx(ArchiveRecordHeader h)
           
 java.lang.String getHeaderString()
           
 org.apache.commons.httpclient.Header[] getHttpHeaders()
           
protected  java.lang.String getIp4Cdx(ArchiveRecordHeader h)
           
 ARCRecordMetaData getMetaData()
           
 int getStatusCode()
          Return status code for this record.
protected  java.lang.String getStatusCode4Cdx(ArchiveRecordHeader h)
           
 int read()
           
 int read(byte[] b, int offset, int length)
           
 void skipHttpHeader()
          Skip over the the http header if one present.
 
Methods inherited from class org.archive.io.ArchiveRecord
available, close, dump, dump, getDigestStr, getHeader, getIn, getMimetype4Cdx, getPosition, incrementPosition, incrementPosition, isEor, isStrict, markSupported, outputCdx, setEor, setHeader, setStrict, skip
 
Methods inherited from class java.io.InputStream
mark, read, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ARCRecord

public ARCRecord(java.io.InputStream in,
                 ArchiveRecordHeader metaData)
          throws java.io.IOException
Constructor.

Parameters:
in - Stream cue'd up to be at the start of the record this instance is to represent.
metaData - Meta data.
Throws:
java.io.IOException

ARCRecord

public ARCRecord(java.io.InputStream in,
                 ArchiveRecordHeader metaData,
                 int bodyOffset,
                 boolean digest,
                 boolean strict,
                 boolean parseHttpHeaders)
          throws java.io.IOException
Constructor.

Parameters:
in - Stream cue'd up to be at the start of the record this instance is to represent.
metaData - Meta data.
bodyOffset - Offset into the body. Usually 0.
digest - True if we're to calculate digest for this record. Not digesting saves about ~15% of cpu during an ARC parse.
strict - Be strict parsing (Parsing stops if ARC inproperly formatted).
parseHttpHeaders - True if we are to parse HTTP headers. Costs about ~20% of CPU during an ARC parse.
Throws:
java.io.IOException
Method Detail

getHeaderString

public java.lang.String getHeaderString()

skipHttpHeader

public void skipHttpHeader()
                    throws java.io.IOException
Skip over the the http header if one present. Subsequent reads will get the body.

Calling this method in the midst of reading the header will make for strange results. Otherwise, safe to call at any time though before reading any of the arc record content is only time that it makes sense.

After calling this method, you can call getHttpHeaders() to get the read http header.

Throws:
java.io.IOException

dumpHttpHeader

public void dumpHttpHeader()
                    throws java.io.IOException
Throws:
java.io.IOException

getStatusCode

public int getStatusCode()
Return status code for this record. This method will return -1 until the http header has been read.

Returns:
Status code.

getMetaData

public ARCRecordMetaData getMetaData()
Returns:
Meta data for this record.

getHttpHeaders

public org.apache.commons.httpclient.Header[] getHttpHeaders()
Returns:
http headers (Only available after header has been read).

read

public int read()
         throws java.io.IOException
Overrides:
read in class ArchiveRecord
Returns:
Next character in this ARCRecord's content else -1 if at end of this record.
Throws:
java.io.IOException

read

public int read(byte[] b,
                int offset,
                int length)
         throws java.io.IOException
Overrides:
read in class ArchiveRecord
Throws:
java.io.IOException

getBodyOffset

public int getBodyOffset()
Returns:
Offset at which the body begins (Only known after header has been read) or -1 if none or if we haven't read headers yet. Usually length of HTTP headers (does not include ARC metadata line length).

getIp4Cdx

protected java.lang.String getIp4Cdx(ArchiveRecordHeader h)
Overrides:
getIp4Cdx in class ArchiveRecord

getStatusCode4Cdx

protected java.lang.String getStatusCode4Cdx(ArchiveRecordHeader h)
Overrides:
getStatusCode4Cdx in class ArchiveRecord

getDigest4Cdx

protected java.lang.String getDigest4Cdx(ArchiveRecordHeader h)
Overrides:
getDigest4Cdx in class ArchiveRecord


Copyright © 2003-2011 Internet Archive. All Rights Reserved.