|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.io.ArchiveReader org.archive.io.arc.ARCReader
public abstract class ARCReader
Get an iterator on an ARC file or get a record by absolute position. ARC files are described here: Arc File Format.
This class knows how to parse an ARC file. Pass it a file path or an URL to an ARC. It can parse ARC Version 1 and 2.
Iterator returns ARCRecord
though Iterator.next()
is returning
java.lang.Object. Cast the return.
Profiling java.io vs. memory-mapped ByteBufferInputStream shows the
latter slightly slower -- but not by much. TODO: Test more. Just
change ArchiveReader.getInputStream(File, long)
.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.io.ArchiveReader |
---|
ArchiveReader.ArchiveRecordIterator, ArchiveReader.RandomAccessBufferedInputStream |
Field Summary | |
---|---|
static java.lang.String[] |
HEADER_FIELD_NAME_KEYS
An array of the header field names found in the ARC file header on the 3rd line. |
(package private) java.util.logging.Logger |
logger
|
Fields inherited from class org.archive.io.ArchiveReader |
---|
MAX_ALLOWED_RECOVERABLES |
Fields inherited from interface org.archive.io.ArchiveFileConstants |
---|
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY |
Constructor Summary | |
---|---|
ARCReader()
|
Method Summary | |
---|---|
protected ARCRecord |
createArchiveRecord(java.io.InputStream is,
long offset)
Create new arc record. |
static void |
createCDXIndexFile(java.lang.String urlOrPath)
Generate a CDX index file for an ARC file. |
void |
dump(boolean compress)
Dump this file on STDOUT |
protected java.util.List<java.lang.String> |
fixSpaceInURL(java.util.List<java.lang.String> values,
int requiredSize)
Fix space in URLs. |
ARCReader |
getDeleteFileOnCloseReader(java.io.File f)
|
java.lang.String |
getDotFileExtension()
|
java.lang.String |
getFileExtension()
|
java.lang.String |
getVersion()
Returns version of this ARC file. |
protected void |
gotoEOR(ArchiveRecord record)
Skip over any trailing new lines at end of the record so we're lined up ready to read the next. |
protected boolean |
isAlignedOnFirstRecord()
|
protected boolean |
isDate(java.lang.String date)
|
protected boolean |
isLegitimateIPValue(java.lang.String ip)
|
protected boolean |
isNumber(java.lang.String n)
|
boolean |
isParseHttpHeaders()
|
static void |
main(java.lang.String[] args)
Command-line interface to ARCReader. |
protected static void |
output(ARCReader reader,
java.lang.String format)
Write out the arcfile. |
protected boolean |
output(java.lang.String format)
|
boolean |
outputRecord(java.lang.String format)
Output passed record using passed format specifier. |
protected void |
setAlignedOnFirstRecord(boolean alignedOnFirstRecord)
|
void |
setParseHttpHeaders(boolean parse)
|
Methods inherited from class org.archive.io.ArchiveReader |
---|
cdxOutput, cleanupCurrentRecord, close, currentRecord, get, get, getCurrentRecord, getFileName, getIn, getInputStream, getInputStream, getLogger, getOptions, getReaderIdentifier, getStrippedFileName, getStrippedFileName, getTrueOrFalse, initialize, isCompressed, isDigest, isStrict, isValid, iterator, logStdErr, outputRecord, rewind, setCompressed, setDigest, setIn, setReaderIdentifier, setStrict, setVersion, stripExtension, validate, validate |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
java.util.logging.Logger logger
public static final java.lang.String[] HEADER_FIELD_NAME_KEYS
Constructor Detail |
---|
ARCReader()
Method Detail |
---|
protected void gotoEOR(ArchiveRecord record) throws java.io.IOException
gotoEOR
in class ArchiveReader
record
-
java.io.IOException
protected ARCRecord createArchiveRecord(java.io.InputStream is, long offset) throws java.io.IOException
Call this method at end of constructor to read in the arcfile header. Will be problems reading subsequent arc records if you don't since arcfile header has the list of metadata fields for all records that follow.
When parsing through ARCs writing out CDX info, we spend about 38% of CPU in here -- about 30% of which is in getTokenizedHeaderLine -- of which 16% is reading.
createArchiveRecord
in class ArchiveReader
is
- InputStream to use.offset
- Absolute offset into arc file.
java.io.IOException
public java.lang.String getVersion()
getVersion
in class ArchiveReader
protected boolean isDate(java.lang.String date)
protected boolean isNumber(java.lang.String n)
protected boolean isLegitimateIPValue(java.lang.String ip)
protected java.util.List<java.lang.String> fixSpaceInURL(java.util.List<java.lang.String> values, int requiredSize)
values
- List of metadata values.requiredSize
- Expected size of resultant values list.
protected boolean isAlignedOnFirstRecord()
protected void setAlignedOnFirstRecord(boolean alignedOnFirstRecord)
public boolean isParseHttpHeaders()
public void setParseHttpHeaders(boolean parse)
parse
- The parseHttpHeaders to set.public java.lang.String getFileExtension()
getFileExtension
in class ArchiveReader
public java.lang.String getDotFileExtension()
getDotFileExtension
in class ArchiveReader
protected boolean output(java.lang.String format) throws java.io.IOException, java.text.ParseException
output
in class ArchiveReader
format
- Format to use outputting.
java.io.IOException
java.text.ParseException
public boolean outputRecord(java.lang.String format) throws java.io.IOException
ArchiveReader
outputRecord
in class ArchiveReader
format
- What format to use outputting.
java.io.IOException
public void dump(boolean compress) throws java.io.IOException, java.text.ParseException
ArchiveReader
dump
in class ArchiveReader
java.io.IOException
java.text.ParseException
public ARCReader getDeleteFileOnCloseReader(java.io.File f)
getDeleteFileOnCloseReader
in class ArchiveReader
protected static void output(ARCReader reader, java.lang.String format) throws java.io.IOException, java.text.ParseException
reader
- format
- Format to use outputting.
java.io.IOException
java.text.ParseException
public static void createCDXIndexFile(java.lang.String urlOrPath) throws java.io.IOException, java.text.ParseException
urlOrPath
- The ARC file to generate a CDX index for
java.io.IOException
java.text.ParseException
public static void main(java.lang.String[] args) throws org.apache.commons.cli.ParseException, java.io.IOException, java.text.ParseException
usage: java org.archive.io.arc.ARCReader [--offset=#] ARCFILE -h,--help Prints this message and exits. -o,--offset Outputs record at this offset into arc file.
See in $HERITRIX_HOME/bin/arcreader
for a script that'll
take care of classpaths and the calling of ARCReader.
Outputs using a pseudo-CDX format as described here: CDX Legent and here Example. Legend used in below is: 'CDX b e a m s c V (or v if uncompressed) n g'. Hash is hard-coded straight SHA-1 hash of content.
args
- Command-line arguments.
org.apache.commons.cli.ParseException
- Failed parse of the command line.
java.io.IOException
java.text.ParseException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |