org.archive.io.warc
Class WARCReader

java.lang.Object
  extended by org.archive.io.ArchiveReader
      extended by org.archive.io.warc.WARCReader
All Implemented Interfaces:
ArchiveFileConstants, WARCConstants
Direct Known Subclasses:
WARCReaderFactory.CompressedWARCReader, WARCReaderFactory.UncompressedWARCReader

public class WARCReader
extends ArchiveReader
implements WARCConstants

WARCReader. Go via WARCReaderFactory to get instance.

Version:
$Date: 2006-11-27 18:03:03 -0800 (Mon, 27 Nov 2006) $ $Version$
Author:
stack

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.io.ArchiveReader
ArchiveReader.ArchiveRecordIterator, ArchiveReader.RandomAccessBufferedInputStream
 
Field Summary
 
Fields inherited from class org.archive.io.ArchiveReader
MAX_ALLOWED_RECOVERABLES
 
Fields inherited from interface org.archive.io.warc.WARCConstants
COLON_SPACE, COMPRESSED_WARC_FILE_EXTENSION, CONTENT_DESCRIPTION, CONTENT_LENGTH, CONTENT_TYPE, CONTINUATION, CONTINUATION_INDEX, CONVERSION, CONVERSION_INDEX, DEFAULT_ENCODING, DEFAULT_MAX_WARC_FILE_SIZE, DOT_COMPRESSED_FILE_EXTENSION, DOT_COMPRESSED_WARC_FILE_EXTENSION, DOT_WARC_FILE_EXTENSION, FTP_CONTROL_CONVERSATION_MIMETYPE, HEADER_FIELD_KEYS, HEADER_FIELD_SEPARATOR, HEADER_KEY_BLOCK_DIGEST, HEADER_KEY_CONCURRENT_TO, HEADER_KEY_DATE, HEADER_KEY_ETAG, HEADER_KEY_FILENAME, HEADER_KEY_ID, HEADER_KEY_IP, HEADER_KEY_LAST_MODIFIED, HEADER_KEY_PAYLOAD_DIGEST, HEADER_KEY_PROFILE, HEADER_KEY_TRUNCATED, HEADER_KEY_TYPE, HEADER_KEY_URI, HEADER_LINE_ENCODING, HTTP_REQUEST_MIMETYPE, HTTP_RESPONSE_MIMETYPE, MAX_LINE_LENGTH, MAX_WARC_HEADER_LINE_LENGTH, METADATA, METADATA_INDEX, NAMED_FIELD_CHECKSUM_LABEL, NAMED_FIELD_DESCRIPTION, NAMED_FIELD_FILEDESC, NAMED_FIELD_IP_LABEL, NAMED_FIELD_RELATED_LABEL, NAMED_FIELD_TRUNCATED, NAMED_FIELD_TRUNCATED_VALUE_HEAD, NAMED_FIELD_TRUNCATED_VALUE_LENGTH, NAMED_FIELD_TRUNCATED_VALUE_TIME, NAMED_FIELD_TRUNCATED_VALUE_UNSPECIFIED, NAMED_FIELD_WARCFILENAME, PLACEHOLDER_RECORD_LENGTH_STRING, PROFILE_REVISIT_IDENTICAL_DIGEST, PROFILE_REVISIT_NOT_MODIFIED, REQUEST, REQUEST_INDEX, RESOURCE, RESOURCE_INDEX, RESPONSE, RESPONSE_INDEX, REVISIT, REVISIT_INDEX, TRUNCATED_VALUE_UNSPECIFIED, TYPE, TYPES, TYPES_LIST, WARC_010_ID, WARC_010_MAGIC, WARC_FILE_EXTENSION, WARC_HEADER_ENCODING, WARC_ID, WARC_MAGIC, WARC_VERSION, WARCINFO, WARCINFO_INDEX, WSP
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
WARCReader()
           
 
Method Summary
protected  WARCRecord createArchiveRecord(java.io.InputStream is, long offset)
          Create new WARC record.
static void createCDXIndexFile(java.lang.String urlOrPath)
          Generate a CDX index file for an ARC file.
 void dump(boolean compress)
          Dump this file on STDOUT
 ArchiveReader getDeleteFileOnCloseReader(java.io.File f)
           
 java.lang.String getDotFileExtension()
           
 java.lang.String getFileExtension()
           
protected  void gotoEOR(ArchiveRecord record)
          Skip over any trailing new lines at end of the record so we're lined up ready to read the next.
protected  void initialize(java.lang.String i)
          Convenience method used by subclass constructors.
static void main(java.lang.String[] args)
          Command-line interface to WARCReader.
protected static void output(WARCReader reader, java.lang.String format)
          Write out the arcfile.
protected  void readExpectedChar(java.io.InputStream is, int expected)
           
 
Methods inherited from class org.archive.io.ArchiveReader
cdxOutput, cleanupCurrentRecord, close, currentRecord, get, get, getCurrentRecord, getFileName, getIn, getInputStream, getInputStream, getLogger, getOptions, getReaderIdentifier, getStrippedFileName, getStrippedFileName, getTrueOrFalse, getVersion, isCompressed, isDigest, isStrict, isValid, iterator, logStdErr, output, outputRecord, outputRecord, rewind, setCompressed, setDigest, setIn, setReaderIdentifier, setStrict, setVersion, stripExtension, validate, validate
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WARCReader

WARCReader()
Method Detail

initialize

protected void initialize(java.lang.String i)
Description copied from class: ArchiveReader
Convenience method used by subclass constructors.

Overrides:
initialize in class ArchiveReader
Parameters:
i - Identifier for Archive file this reader goes against.

gotoEOR

protected void gotoEOR(ArchiveRecord record)
                throws java.io.IOException
Skip over any trailing new lines at end of the record so we're lined up ready to read the next.

Specified by:
gotoEOR in class ArchiveReader
Parameters:
record -
Throws:
java.io.IOException

readExpectedChar

protected void readExpectedChar(java.io.InputStream is,
                                int expected)
                         throws java.io.IOException
Throws:
java.io.IOException

createArchiveRecord

protected WARCRecord createArchiveRecord(java.io.InputStream is,
                                         long offset)
                                  throws java.io.IOException
Create new WARC record. Encapsulate housekeeping that has to do w/ creating new Record.

Specified by:
createArchiveRecord in class ArchiveReader
Parameters:
is - InputStream to use.
offset - Absolute offset into WARC file.
Returns:
A WARCRecord.
Throws:
java.io.IOException

dump

public void dump(boolean compress)
          throws java.io.IOException,
                 java.text.ParseException
Description copied from class: ArchiveReader
Dump this file on STDOUT

Specified by:
dump in class ArchiveReader
Throws:
java.io.IOException
java.text.ParseException

getDeleteFileOnCloseReader

public ArchiveReader getDeleteFileOnCloseReader(java.io.File f)
Specified by:
getDeleteFileOnCloseReader in class ArchiveReader
Returns:
an ArchiveReader that will delete a local file on close. Used when we bring Archive files local and need to clean up afterward.

getDotFileExtension

public java.lang.String getDotFileExtension()
Specified by:
getDotFileExtension in class ArchiveReader

getFileExtension

public java.lang.String getFileExtension()
Specified by:
getFileExtension in class ArchiveReader

output

protected static void output(WARCReader reader,
                             java.lang.String format)
                      throws java.io.IOException,
                             java.text.ParseException
Write out the arcfile.

Parameters:
reader -
format - Format to use outputting.
Throws:
java.io.IOException
java.text.ParseException

createCDXIndexFile

public static void createCDXIndexFile(java.lang.String urlOrPath)
                               throws java.io.IOException,
                                      java.text.ParseException
Generate a CDX index file for an ARC file.

Parameters:
urlOrPath - The ARC file to generate a CDX index for
Throws:
java.io.IOException
java.text.ParseException

main

public static void main(java.lang.String[] args)
                 throws org.apache.commons.cli.ParseException,
                        java.io.IOException,
                        java.text.ParseException
Command-line interface to WARCReader. Here is the command-line interface:
 usage: java org.archive.io.arc.WARCReader [--offset=#] ARCFILE
  -h,--help      Prints this message and exits.
  -o,--offset    Outputs record at this offset into arc file.

Outputs using a pseudo-CDX format as described here: CDX Legent and here Example. Legend used in below is: 'CDX b e a m s c V (or v if uncompressed) n g'. Hash is hard-coded straight SHA-1 hash of content.

Parameters:
args - Command-line arguments.
Throws:
org.apache.commons.cli.ParseException - Failed parse of the command line.
java.io.IOException
java.text.ParseException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.