org.archive.io.warc
Class WARCReaderFactory.CompressedWARCReader

java.lang.Object
  extended by org.archive.io.ArchiveReader
      extended by org.archive.io.warc.WARCReader
          extended by org.archive.io.warc.WARCReaderFactory.CompressedWARCReader
All Implemented Interfaces:
ArchiveFileConstants, WARCConstants
Enclosing class:
WARCReaderFactory

public class WARCReaderFactory.CompressedWARCReader
extends WARCReader

Compressed WARC file reader.

Author:
stack

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.io.ArchiveReader
ArchiveReader.ArchiveRecordIterator, ArchiveReader.RandomAccessBufferedInputStream
 
Field Summary
 
Fields inherited from class org.archive.io.ArchiveReader
MAX_ALLOWED_RECOVERABLES
 
Fields inherited from interface org.archive.io.warc.WARCConstants
COLON_SPACE, COMPRESSED_WARC_FILE_EXTENSION, CONTENT_DESCRIPTION, CONTENT_LENGTH, CONTENT_TYPE, CONTINUATION, CONTINUATION_INDEX, CONVERSION, CONVERSION_INDEX, DEFAULT_ENCODING, DEFAULT_MAX_WARC_FILE_SIZE, DOT_COMPRESSED_FILE_EXTENSION, DOT_COMPRESSED_WARC_FILE_EXTENSION, DOT_WARC_FILE_EXTENSION, FTP_CONTROL_CONVERSATION_MIMETYPE, HEADER_FIELD_KEYS, HEADER_FIELD_SEPARATOR, HEADER_KEY_BLOCK_DIGEST, HEADER_KEY_CONCURRENT_TO, HEADER_KEY_DATE, HEADER_KEY_ETAG, HEADER_KEY_FILENAME, HEADER_KEY_ID, HEADER_KEY_IP, HEADER_KEY_LAST_MODIFIED, HEADER_KEY_PAYLOAD_DIGEST, HEADER_KEY_PROFILE, HEADER_KEY_TRUNCATED, HEADER_KEY_TYPE, HEADER_KEY_URI, HEADER_LINE_ENCODING, HTTP_REQUEST_MIMETYPE, HTTP_RESPONSE_MIMETYPE, MAX_LINE_LENGTH, MAX_WARC_HEADER_LINE_LENGTH, METADATA, METADATA_INDEX, NAMED_FIELD_CHECKSUM_LABEL, NAMED_FIELD_DESCRIPTION, NAMED_FIELD_FILEDESC, NAMED_FIELD_IP_LABEL, NAMED_FIELD_RELATED_LABEL, NAMED_FIELD_TRUNCATED, NAMED_FIELD_TRUNCATED_VALUE_HEAD, NAMED_FIELD_TRUNCATED_VALUE_LENGTH, NAMED_FIELD_TRUNCATED_VALUE_TIME, NAMED_FIELD_TRUNCATED_VALUE_UNSPECIFIED, NAMED_FIELD_WARCFILENAME, PLACEHOLDER_RECORD_LENGTH_STRING, PROFILE_REVISIT_IDENTICAL_DIGEST, PROFILE_REVISIT_NOT_MODIFIED, REQUEST, REQUEST_INDEX, RESOURCE, RESOURCE_INDEX, RESPONSE, RESPONSE_INDEX, REVISIT, REVISIT_INDEX, TRUNCATED_VALUE_UNSPECIFIED, TYPE, TYPES, TYPES_LIST, WARC_010_ID, WARC_010_MAGIC, WARC_FILE_EXTENSION, WARC_HEADER_ENCODING, WARC_ID, WARC_MAGIC, WARC_VERSION, WARCINFO, WARCINFO_INDEX, WSP
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
WARCReaderFactory.CompressedWARCReader(java.io.File f)
          Constructor.
WARCReaderFactory.CompressedWARCReader(java.io.File f, long offset)
          Constructor.
WARCReaderFactory.CompressedWARCReader(java.lang.String f, java.io.InputStream is, boolean atFirstRecord)
          Constructor.
 
Method Summary
 WARCRecord get(long offset)
          Get record at passed offset.
protected  void gotoEOR(ArchiveRecord rec)
          Skip over any trailing new lines at end of the record so we're lined up ready to read the next.
 java.util.Iterator<ArchiveRecord> iterator()
          Returns an ArchiveRecord iterator.
 
Methods inherited from class org.archive.io.warc.WARCReader
createArchiveRecord, createCDXIndexFile, dump, getDeleteFileOnCloseReader, getDotFileExtension, getFileExtension, initialize, main, output, readExpectedChar
 
Methods inherited from class org.archive.io.ArchiveReader
cdxOutput, cleanupCurrentRecord, close, currentRecord, get, getCurrentRecord, getFileName, getIn, getInputStream, getInputStream, getLogger, getOptions, getReaderIdentifier, getStrippedFileName, getStrippedFileName, getTrueOrFalse, getVersion, isCompressed, isDigest, isStrict, isValid, logStdErr, output, outputRecord, outputRecord, rewind, setCompressed, setDigest, setIn, setReaderIdentifier, setStrict, setVersion, stripExtension, validate, validate
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WARCReaderFactory.CompressedWARCReader

public WARCReaderFactory.CompressedWARCReader(java.io.File f)
                                       throws java.io.IOException
Constructor.

Parameters:
f - Compressed file to read.
Throws:
java.io.IOException

WARCReaderFactory.CompressedWARCReader

public WARCReaderFactory.CompressedWARCReader(java.io.File f,
                                              long offset)
                                       throws java.io.IOException
Constructor.

Parameters:
f - Compressed arcfile to read.
offset - Position at where to start reading file.
Throws:
java.io.IOException

WARCReaderFactory.CompressedWARCReader

public WARCReaderFactory.CompressedWARCReader(java.lang.String f,
                                              java.io.InputStream is,
                                              boolean atFirstRecord)
                                       throws java.io.IOException
Constructor.

Parameters:
f - Compressed arcfile.
is - InputStream to use.
atFirstRecord -
Throws:
java.io.IOException
Method Detail

get

public WARCRecord get(long offset)
               throws java.io.IOException
Get record at passed offset.

Overrides:
get in class ArchiveReader
Parameters:
offset - Byte index into file at which a record starts.
Returns:
A WARCRecord reference.
Throws:
java.io.IOException

iterator

public java.util.Iterator<ArchiveRecord> iterator()
Description copied from class: ArchiveReader
Returns an ArchiveRecord iterator. Of note, on IOException, especially if ZipException reading compressed ARCs, rather than fail the iteration, try moving to the next record. If ArchiveReader.strict is not set, this will usually succeed.

Overrides:
iterator in class ArchiveReader
Returns:
An iterator over ARC records.

gotoEOR

protected void gotoEOR(ArchiveRecord rec)
                throws java.io.IOException
Description copied from class: WARCReader
Skip over any trailing new lines at end of the record so we're lined up ready to read the next.

Overrides:
gotoEOR in class WARCReader
Throws:
java.io.IOException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.