org.archive.io
Class ArchiveReader

java.lang.Object
  extended by org.archive.io.ArchiveReader
All Implemented Interfaces:
ArchiveFileConstants
Direct Known Subclasses:
ARCReader, WARCReader

public abstract class ArchiveReader
extends java.lang.Object
implements ArchiveFileConstants

Reader for an Archive file of Archive ArchiveRecords.

Version:
$Date: 2010-04-26 21:49:27 +0000 (Mon, 26 Apr 2010) $ $Version$
Author:
stack

Nested Class Summary
protected  class ArchiveReader.ArchiveRecordIterator
          Inner ArchiveRecord Iterator class.
protected  class ArchiveReader.RandomAccessBufferedInputStream
          Add buffering to RandomAccessInputStream.
 
Field Summary
static int MAX_ALLOWED_RECOVERABLES
          Maximum amount of recoverable exceptions in a row.
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
protected ArchiveReader()
           
 
Method Summary
protected  void cdxOutput(boolean toFile)
           
protected  void cleanupCurrentRecord()
          Cleanout the current record if there is one.
 void close()
           
protected abstract  ArchiveRecord createArchiveRecord(java.io.InputStream is, long offset)
          Return an Archive Record homed on offset into is.
protected  ArchiveRecord currentRecord(ArchiveRecord currentRecord)
           
abstract  void dump(boolean compress)
          Dump this file on STDOUT
 ArchiveRecord get()
           
 ArchiveRecord get(long offset)
          Get record at passed offset.
protected  ArchiveRecord getCurrentRecord()
           
abstract  ArchiveReader getDeleteFileOnCloseReader(java.io.File f)
           
abstract  java.lang.String getDotFileExtension()
           
abstract  java.lang.String getFileExtension()
           
 java.lang.String getFileName()
           
protected  java.io.InputStream getIn()
           
protected  java.io.InputStream getInputStream()
           
protected  java.io.InputStream getInputStream(java.io.File f, long offset)
          Convenience method for constructors.
protected  java.util.logging.Logger getLogger()
           
protected static org.apache.commons.cli.Options getOptions()
           
 java.lang.String getReaderIdentifier()
           
 java.lang.String getStrippedFileName()
           
static java.lang.String getStrippedFileName(java.lang.String name, java.lang.String dotFileExtension)
           
protected static boolean getTrueOrFalse(java.lang.String value)
           
 java.lang.String getVersion()
           
protected abstract  void gotoEOR(ArchiveRecord record)
          Skip over any trailing new lines at end of the record so we're lined up ready to read the next.
protected  void initialize(java.lang.String i)
          Convenience method used by subclass constructors.
 boolean isCompressed()
           
 boolean isDigest()
           
 boolean isStrict()
           
 boolean isValid()
          Test Archive file is valid.
 java.util.Iterator<ArchiveRecord> iterator()
          Returns an ArchiveRecord iterator.
 void logStdErr(java.util.logging.Level level, java.lang.String message)
          Log on stderr.
protected  boolean output(java.lang.String format)
           
protected static void outputRecord(ArchiveReader r, java.lang.String format)
          Output passed record using passed format specifier.
 boolean outputRecord(java.lang.String format)
          Output passed record using passed format specifier.
protected  void rewind()
          Rewinds stream to start of the Archive file.
protected  void setCompressed(boolean compressed)
           
 void setDigest(boolean d)
           
protected  void setIn(java.io.InputStream in)
           
protected  void setReaderIdentifier(java.lang.String i)
           
 void setStrict(boolean s)
           
protected  void setVersion(java.lang.String version)
           
protected static java.lang.String stripExtension(java.lang.String name, java.lang.String ext)
           
 java.util.List<ArchiveRecordHeader> validate()
          Validate the Archive file.
 java.util.List<ArchiveRecordHeader> validate(int numRecords)
          Validate the Archive file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_ALLOWED_RECOVERABLES

public static final int MAX_ALLOWED_RECOVERABLES
Maximum amount of recoverable exceptions in a row. If more than this amount in a row, we'll let out the exception rather than go back in for yet another retry.

See Also:
Constant Field Values
Constructor Detail

ArchiveReader

protected ArchiveReader()
Method Detail

initialize

protected void initialize(java.lang.String i)
Convenience method used by subclass constructors.

Parameters:
i - Identifier for Archive file this reader goes against.

getInputStream

protected java.io.InputStream getInputStream(java.io.File f,
                                             long offset)
                                      throws java.io.IOException
Convenience method for constructors.

Parameters:
f - File to read.
offset - Offset at which to start reading.
Returns:
InputStream to read from.
Throws:
java.io.IOException - If failed open or fail to get a memory mapped byte buffer on file.

isCompressed

public boolean isCompressed()

get

public ArchiveRecord get(long offset)
                  throws java.io.IOException
Get record at passed offset.

Parameters:
offset - Byte index into file at which a record starts.
Returns:
An Archive Record reference.
Throws:
java.io.IOException

get

public ArchiveRecord get()
                  throws java.io.IOException
Returns:
Return Archive Record created against current offset.
Throws:
java.io.IOException

close

public void close()
           throws java.io.IOException
Throws:
java.io.IOException

rewind

protected void rewind()
               throws java.io.IOException
Rewinds stream to start of the Archive file.

Throws:
java.io.IOException - if stream is not resettable.

cleanupCurrentRecord

protected void cleanupCurrentRecord()
                             throws java.io.IOException
Cleanout the current record if there is one.

Throws:
java.io.IOException

createArchiveRecord

protected abstract ArchiveRecord createArchiveRecord(java.io.InputStream is,
                                                     long offset)
                                              throws java.io.IOException
Return an Archive Record homed on offset into is.

Parameters:
is - Stream to read Record from.
offset - Offset to find Record at.
Returns:
ArchiveRecord instance.
Throws:
java.io.IOException

gotoEOR

protected abstract void gotoEOR(ArchiveRecord record)
                         throws java.io.IOException
Skip over any trailing new lines at end of the record so we're lined up ready to read the next.

Parameters:
record -
Throws:
java.io.IOException

getFileExtension

public abstract java.lang.String getFileExtension()

getDotFileExtension

public abstract java.lang.String getDotFileExtension()

getVersion

public java.lang.String getVersion()
Returns:
Version of this Archive file.

validate

public java.util.List<ArchiveRecordHeader> validate()
                                             throws java.io.IOException
Validate the Archive file. This method iterates over the file throwing exception if it fails to successfully parse any record.

Assumes the stream is at the start of the file.

Returns:
List of all read Archive Headers.
Throws:
java.io.IOException

validate

public java.util.List<ArchiveRecordHeader> validate(int numRecords)
                                             throws java.io.IOException
Validate the Archive file. This method iterates over the file throwing exception if it fails to successfully parse.

We start validation from wherever we are in the stream.

Parameters:
numRecords - Number of records expected. Pass -1 if number is unknown.
Returns:
List of all read metadatas. As we validate records, we add a reference to the read metadata.
Throws:
java.io.IOException

isValid

public boolean isValid()
Test Archive file is valid. Assumes the stream is at the start of the file. Be aware that this method makes a pass over the whole file.

Returns:
True if file can be successfully parsed.

isStrict

public boolean isStrict()
Returns:
Returns the strict.

setStrict

public void setStrict(boolean s)
Parameters:
s - The strict to set.

setDigest

public void setDigest(boolean d)
Parameters:
d - True if we're to digest.

isDigest

public boolean isDigest()
Returns:
True if we're digesting as we read.

getLogger

protected java.util.logging.Logger getLogger()

getInputStream

protected java.io.InputStream getInputStream()

iterator

public java.util.Iterator<ArchiveRecord> iterator()
Returns an ArchiveRecord iterator. Of note, on IOException, especially if ZipException reading compressed ARCs, rather than fail the iteration, try moving to the next record. If strict is not set, this will usually succeed.

Returns:
An iterator over ARC records.

setCompressed

protected void setCompressed(boolean compressed)

getCurrentRecord

protected ArchiveRecord getCurrentRecord()
Returns:
The current ARC record or null if none. After construction has the arcfile header record.
See Also:
get()

currentRecord

protected ArchiveRecord currentRecord(ArchiveRecord currentRecord)

getIn

protected java.io.InputStream getIn()

setIn

protected void setIn(java.io.InputStream in)

setVersion

protected void setVersion(java.lang.String version)

getReaderIdentifier

public java.lang.String getReaderIdentifier()

setReaderIdentifier

protected void setReaderIdentifier(java.lang.String i)

logStdErr

public void logStdErr(java.util.logging.Level level,
                      java.lang.String message)
Log on stderr. Logging should go via the logging system. This method bypasses the logging system going direct to stderr. Should not generally be used. Its used for rare messages that come of cmdline usage of ARCReader ERRORs and WARNINGs. Override if using ARCReader in a context where no stderr or where you'd like to redirect stderr to other than System.err.

Parameters:
level - Level to log message at.
message - Message to log.

stripExtension

protected static java.lang.String stripExtension(java.lang.String name,
                                                 java.lang.String ext)

getFileName

public java.lang.String getFileName()
Returns:
short name of Archive file.

getStrippedFileName

public java.lang.String getStrippedFileName()
Returns:
short name of Archive file.

getStrippedFileName

public static java.lang.String getStrippedFileName(java.lang.String name,
                                                   java.lang.String dotFileExtension)
Parameters:
name - Name of ARCFile.
dotFileExtension - '.arc' or '.warc', etc.
Returns:
short name of Archive file.

getTrueOrFalse

protected static boolean getTrueOrFalse(java.lang.String value)
Parameters:
value - Value to test.
Returns:
True if value is 'true', else false.

output

protected boolean output(java.lang.String format)
                  throws java.io.IOException,
                         java.text.ParseException
Parameters:
format - Format to use outputting.
Returns:
True if handled.
Throws:
java.io.IOException
java.text.ParseException

cdxOutput

protected void cdxOutput(boolean toFile)
                  throws java.io.IOException
Throws:
java.io.IOException

outputRecord

public boolean outputRecord(java.lang.String format)
                     throws java.io.IOException
Output passed record using passed format specifier.

Parameters:
format - What format to use outputting.
Returns:
True if handled.
Throws:
java.io.IOException

dump

public abstract void dump(boolean compress)
                   throws java.io.IOException,
                          java.text.ParseException
Dump this file on STDOUT

Throws:
compress - True if dumped output is compressed.
java.io.IOException
java.text.ParseException

getDeleteFileOnCloseReader

public abstract ArchiveReader getDeleteFileOnCloseReader(java.io.File f)
Returns:
an ArchiveReader that will delete a local file on close. Used when we bring Archive files local and need to clean up afterward.

outputRecord

protected static void outputRecord(ArchiveReader r,
                                   java.lang.String format)
                            throws java.io.IOException
Output passed record using passed format specifier.

Parameters:
r - ARCReader instance to output.
format - What format to use outputting.
Throws:
java.io.IOException

getOptions

protected static org.apache.commons.cli.Options getOptions()
Returns:
Base Options object filled out with help, digest, strict, etc. options.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.