org.archive.io.warc
Class WARCRecord
java.lang.Object
java.io.InputStream
org.archive.io.ArchiveRecord
org.archive.io.warc.WARCRecord
- All Implemented Interfaces:
- java.io.Closeable, ArchiveFileConstants, WARCConstants
public class WARCRecord
- extends ArchiveRecord
- implements WARCConstants
A WARC file Record.
- Author:
- stack
Fields inherited from interface org.archive.io.warc.WARCConstants |
COLON_SPACE, COMPRESSED_WARC_FILE_EXTENSION, CONTENT_DESCRIPTION, CONTENT_LENGTH, CONTENT_TYPE, CONTINUATION, CONTINUATION_INDEX, CONVERSION, CONVERSION_INDEX, DEFAULT_ENCODING, DEFAULT_MAX_WARC_FILE_SIZE, DOT_COMPRESSED_FILE_EXTENSION, DOT_COMPRESSED_WARC_FILE_EXTENSION, DOT_WARC_FILE_EXTENSION, FTP_CONTROL_CONVERSATION_MIMETYPE, HEADER_FIELD_KEYS, HEADER_FIELD_SEPARATOR, HEADER_KEY_BLOCK_DIGEST, HEADER_KEY_CONCURRENT_TO, HEADER_KEY_DATE, HEADER_KEY_ETAG, HEADER_KEY_FILENAME, HEADER_KEY_ID, HEADER_KEY_IP, HEADER_KEY_LAST_MODIFIED, HEADER_KEY_PAYLOAD_DIGEST, HEADER_KEY_PROFILE, HEADER_KEY_TRUNCATED, HEADER_KEY_TYPE, HEADER_KEY_URI, HEADER_LINE_ENCODING, HTTP_REQUEST_MIMETYPE, HTTP_RESPONSE_MIMETYPE, MAX_LINE_LENGTH, MAX_WARC_HEADER_LINE_LENGTH, METADATA, METADATA_INDEX, NAMED_FIELD_CHECKSUM_LABEL, NAMED_FIELD_DESCRIPTION, NAMED_FIELD_FILEDESC, NAMED_FIELD_IP_LABEL, NAMED_FIELD_RELATED_LABEL, NAMED_FIELD_TRUNCATED, NAMED_FIELD_TRUNCATED_VALUE_HEAD, NAMED_FIELD_TRUNCATED_VALUE_LENGTH, NAMED_FIELD_TRUNCATED_VALUE_TIME, NAMED_FIELD_TRUNCATED_VALUE_UNSPECIFIED, NAMED_FIELD_WARCFILENAME, PLACEHOLDER_RECORD_LENGTH_STRING, PROFILE_REVISIT_IDENTICAL_DIGEST, PROFILE_REVISIT_NOT_MODIFIED, REQUEST, REQUEST_INDEX, RESOURCE, RESOURCE_INDEX, RESPONSE, RESPONSE_INDEX, REVISIT, REVISIT_INDEX, TRUNCATED_VALUE_UNSPECIFIED, TYPE, TYPES, TYPES_LIST, WARC_010_ID, WARC_010_MAGIC, WARC_FILE_EXTENSION, WARC_HEADER_ENCODING, WARC_ID, WARC_MAGIC, WARC_VERSION, WARCINFO, WARCINFO_INDEX, WSP |
Fields inherited from interface org.archive.io.ArchiveFileConstants |
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY |
Constructor Summary |
WARCRecord(java.io.InputStream in,
ArchiveRecordHeader headers)
Constructor. |
WARCRecord(java.io.InputStream in,
java.lang.String identifier,
long offset)
Constructor. |
WARCRecord(java.io.InputStream in,
java.lang.String identifier,
long offset,
boolean digest,
boolean strict)
Constructor. |
Methods inherited from class org.archive.io.ArchiveRecord |
available, close, dump, dump, getDigest4Cdx, getDigestStr, getHeader, getIn, getIp4Cdx, getPosition, getStatusCode4Cdx, incrementPosition, incrementPosition, isEor, isStrict, markSupported, outputCdx, read, read, setEor, setHeader, setStrict, skip |
Methods inherited from class java.io.InputStream |
mark, read, reset |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
WARCRecord
public WARCRecord(java.io.InputStream in,
java.lang.String identifier,
long offset)
throws java.io.IOException
- Constructor.
- Parameters:
in
- Stream cue'd up to be at the start of the record this instance
is to represent.
- Throws:
java.io.IOException
WARCRecord
public WARCRecord(java.io.InputStream in,
ArchiveRecordHeader headers)
throws java.io.IOException
- Constructor.
- Parameters:
in
- Stream cue'd up just past Header Line and Named Fields.headers
- Header Line and ANVL Named fields.
- Throws:
java.io.IOException
WARCRecord
public WARCRecord(java.io.InputStream in,
java.lang.String identifier,
long offset,
boolean digest,
boolean strict)
throws java.io.IOException
- Constructor.
- Parameters:
in
- Stream cue'd up to be at the start of the record this instance
is to represent or, if headers
is not null, just past the
Header Line and Named Fields.identifier
- Identifier for this the hosting Reader.offset
- Current offset into in
(Used to keep
position
properly aligned). Usually 0.digest
- True if we're to calculate digest for this record. Not
digesting saves about ~15% of cpu during parse.strict
- Be strict parsing (Parsing stops if file inproperly
formatted).
- Throws:
java.io.IOException
parseHeaders
protected ArchiveRecordHeader parseHeaders(java.io.InputStream in,
java.lang.String identifier,
long offset,
boolean strict)
throws java.io.IOException
- Parse WARC Header Line and Named Fields.
- Parameters:
in
- Stream to read.identifier
- Identifier for the hosting Reader.offset
- Absolute offset into Reader.strict
- Whether to be loose parsing or not.
- Returns:
- An ArchiveRecordHeader.
- Throws:
java.io.IOException
getMimetype4Cdx
protected java.lang.String getMimetype4Cdx(ArchiveRecordHeader h)
- Overrides:
getMimetype4Cdx
in class ArchiveRecord
Copyright © 2003-2011 Internet Archive. All Rights Reserved.