org.archive.io.warc
Class WARCRecord

java.lang.Object
  extended by java.io.InputStream
      extended by org.archive.io.ArchiveRecord
          extended by org.archive.io.warc.WARCRecord
All Implemented Interfaces:
java.io.Closeable, ArchiveFileConstants, WARCConstants

public class WARCRecord
extends ArchiveRecord
implements WARCConstants

A WARC file Record.

Author:
stack

Field Summary
 
Fields inherited from class org.archive.io.ArchiveRecord
digest
 
Fields inherited from interface org.archive.io.warc.WARCConstants
COLON_SPACE, COMPRESSED_WARC_FILE_EXTENSION, CONTENT_DESCRIPTION, CONTENT_LENGTH, CONTENT_TYPE, CONTINUATION, CONTINUATION_INDEX, CONVERSION, CONVERSION_INDEX, DEFAULT_ENCODING, DEFAULT_MAX_WARC_FILE_SIZE, DOT_COMPRESSED_FILE_EXTENSION, DOT_COMPRESSED_WARC_FILE_EXTENSION, DOT_WARC_FILE_EXTENSION, FTP_CONTROL_CONVERSATION_MIMETYPE, HEADER_FIELD_KEYS, HEADER_FIELD_SEPARATOR, HEADER_KEY_BLOCK_DIGEST, HEADER_KEY_CONCURRENT_TO, HEADER_KEY_DATE, HEADER_KEY_ETAG, HEADER_KEY_FILENAME, HEADER_KEY_ID, HEADER_KEY_IP, HEADER_KEY_LAST_MODIFIED, HEADER_KEY_PAYLOAD_DIGEST, HEADER_KEY_PROFILE, HEADER_KEY_TRUNCATED, HEADER_KEY_TYPE, HEADER_KEY_URI, HEADER_LINE_ENCODING, HTTP_REQUEST_MIMETYPE, HTTP_RESPONSE_MIMETYPE, MAX_LINE_LENGTH, MAX_WARC_HEADER_LINE_LENGTH, METADATA, METADATA_INDEX, NAMED_FIELD_CHECKSUM_LABEL, NAMED_FIELD_DESCRIPTION, NAMED_FIELD_FILEDESC, NAMED_FIELD_IP_LABEL, NAMED_FIELD_RELATED_LABEL, NAMED_FIELD_TRUNCATED, NAMED_FIELD_TRUNCATED_VALUE_HEAD, NAMED_FIELD_TRUNCATED_VALUE_LENGTH, NAMED_FIELD_TRUNCATED_VALUE_TIME, NAMED_FIELD_TRUNCATED_VALUE_UNSPECIFIED, NAMED_FIELD_WARCFILENAME, PLACEHOLDER_RECORD_LENGTH_STRING, PROFILE_REVISIT_IDENTICAL_DIGEST, PROFILE_REVISIT_NOT_MODIFIED, REQUEST, REQUEST_INDEX, RESOURCE, RESOURCE_INDEX, RESPONSE, RESPONSE_INDEX, REVISIT, REVISIT_INDEX, TRUNCATED_VALUE_UNSPECIFIED, TYPE, TYPES, TYPES_LIST, WARC_010_ID, WARC_010_MAGIC, WARC_FILE_EXTENSION, WARC_HEADER_ENCODING, WARC_ID, WARC_MAGIC, WARC_VERSION, WARCINFO, WARCINFO_INDEX, WSP
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
WARCRecord(java.io.InputStream in, ArchiveRecordHeader headers)
          Constructor.
WARCRecord(java.io.InputStream in, java.lang.String identifier, long offset)
          Constructor.
WARCRecord(java.io.InputStream in, java.lang.String identifier, long offset, boolean digest, boolean strict)
          Constructor.
 
Method Summary
protected  java.lang.String getMimetype4Cdx(ArchiveRecordHeader h)
           
protected  ArchiveRecordHeader parseHeaders(java.io.InputStream in, java.lang.String identifier, long offset, boolean strict)
          Parse WARC Header Line and Named Fields.
 
Methods inherited from class org.archive.io.ArchiveRecord
available, close, dump, dump, getDigest4Cdx, getDigestStr, getHeader, getIn, getIp4Cdx, getPosition, getStatusCode4Cdx, incrementPosition, incrementPosition, isEor, isStrict, markSupported, outputCdx, read, read, setEor, setHeader, setStrict, skip
 
Methods inherited from class java.io.InputStream
mark, read, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WARCRecord

public WARCRecord(java.io.InputStream in,
                  java.lang.String identifier,
                  long offset)
           throws java.io.IOException
Constructor.

Parameters:
in - Stream cue'd up to be at the start of the record this instance is to represent.
Throws:
java.io.IOException

WARCRecord

public WARCRecord(java.io.InputStream in,
                  ArchiveRecordHeader headers)
           throws java.io.IOException
Constructor.

Parameters:
in - Stream cue'd up just past Header Line and Named Fields.
headers - Header Line and ANVL Named fields.
Throws:
java.io.IOException

WARCRecord

public WARCRecord(java.io.InputStream in,
                  java.lang.String identifier,
                  long offset,
                  boolean digest,
                  boolean strict)
           throws java.io.IOException
Constructor.

Parameters:
in - Stream cue'd up to be at the start of the record this instance is to represent or, if headers is not null, just past the Header Line and Named Fields.
identifier - Identifier for this the hosting Reader.
offset - Current offset into in (Used to keep position properly aligned). Usually 0.
digest - True if we're to calculate digest for this record. Not digesting saves about ~15% of cpu during parse.
strict - Be strict parsing (Parsing stops if file inproperly formatted).
Throws:
java.io.IOException
Method Detail

parseHeaders

protected ArchiveRecordHeader parseHeaders(java.io.InputStream in,
                                           java.lang.String identifier,
                                           long offset,
                                           boolean strict)
                                    throws java.io.IOException
Parse WARC Header Line and Named Fields.

Parameters:
in - Stream to read.
identifier - Identifier for the hosting Reader.
offset - Absolute offset into Reader.
strict - Whether to be loose parsing or not.
Returns:
An ArchiveRecordHeader.
Throws:
java.io.IOException

getMimetype4Cdx

protected java.lang.String getMimetype4Cdx(ArchiveRecordHeader h)
Overrides:
getMimetype4Cdx in class ArchiveRecord


Copyright © 2003-2011 Internet Archive. All Rights Reserved.