|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.io.ArchiveReader org.archive.io.warc.WARCReader
public class WARCReader
WARCReader.
Go via WARCReaderFactory
to get instance.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.io.ArchiveReader |
---|
ArchiveReader.ArchiveRecordIterator, ArchiveReader.RandomAccessBufferedInputStream |
Field Summary |
---|
Fields inherited from class org.archive.io.ArchiveReader |
---|
MAX_ALLOWED_RECOVERABLES |
Fields inherited from interface org.archive.io.ArchiveFileConstants |
---|
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY |
Constructor Summary | |
---|---|
WARCReader()
|
Method Summary | |
---|---|
protected WARCRecord |
createArchiveRecord(java.io.InputStream is,
long offset)
Create new WARC record. |
static void |
createCDXIndexFile(java.lang.String urlOrPath)
Generate a CDX index file for an ARC file. |
void |
dump(boolean compress)
Dump this file on STDOUT |
ArchiveReader |
getDeleteFileOnCloseReader(java.io.File f)
|
java.lang.String |
getDotFileExtension()
|
java.lang.String |
getFileExtension()
|
protected void |
gotoEOR(ArchiveRecord record)
Skip over any trailing new lines at end of the record so we're lined up ready to read the next. |
protected void |
initialize(java.lang.String i)
Convenience method used by subclass constructors. |
static void |
main(java.lang.String[] args)
Command-line interface to WARCReader. |
protected static void |
output(WARCReader reader,
java.lang.String format)
Write out the arcfile. |
protected void |
readExpectedChar(java.io.InputStream is,
int expected)
|
Methods inherited from class org.archive.io.ArchiveReader |
---|
cdxOutput, cleanupCurrentRecord, close, currentRecord, get, get, getCurrentRecord, getFileName, getIn, getInputStream, getInputStream, getLogger, getOptions, getReaderIdentifier, getStrippedFileName, getStrippedFileName, getTrueOrFalse, getVersion, isCompressed, isDigest, isStrict, isValid, iterator, logStdErr, output, outputRecord, outputRecord, rewind, setCompressed, setDigest, setIn, setReaderIdentifier, setStrict, setVersion, stripExtension, validate, validate |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
WARCReader()
Method Detail |
---|
protected void initialize(java.lang.String i)
ArchiveReader
initialize
in class ArchiveReader
i
- Identifier for Archive file this reader goes against.protected void gotoEOR(ArchiveRecord record) throws java.io.IOException
gotoEOR
in class ArchiveReader
record
-
java.io.IOException
protected void readExpectedChar(java.io.InputStream is, int expected) throws java.io.IOException
java.io.IOException
protected WARCRecord createArchiveRecord(java.io.InputStream is, long offset) throws java.io.IOException
createArchiveRecord
in class ArchiveReader
is
- InputStream to use.offset
- Absolute offset into WARC file.
java.io.IOException
public void dump(boolean compress) throws java.io.IOException, java.text.ParseException
ArchiveReader
dump
in class ArchiveReader
java.io.IOException
java.text.ParseException
public ArchiveReader getDeleteFileOnCloseReader(java.io.File f)
getDeleteFileOnCloseReader
in class ArchiveReader
public java.lang.String getDotFileExtension()
getDotFileExtension
in class ArchiveReader
public java.lang.String getFileExtension()
getFileExtension
in class ArchiveReader
protected static void output(WARCReader reader, java.lang.String format) throws java.io.IOException, java.text.ParseException
reader
- format
- Format to use outputting.
java.io.IOException
java.text.ParseException
public static void createCDXIndexFile(java.lang.String urlOrPath) throws java.io.IOException, java.text.ParseException
urlOrPath
- The ARC file to generate a CDX index for
java.io.IOException
java.text.ParseException
public static void main(java.lang.String[] args) throws org.apache.commons.cli.ParseException, java.io.IOException, java.text.ParseException
usage: java org.archive.io.arc.WARCReader [--offset=#] ARCFILE -h,--help Prints this message and exits. -o,--offset Outputs record at this offset into arc file.
Outputs using a pseudo-CDX format as described here: CDX Legent and here Example. Legend used in below is: 'CDX b e a m s c V (or v if uncompressed) n g'. Hash is hard-coded straight SHA-1 hash of content.
args
- Command-line arguments.
org.apache.commons.cli.ParseException
- Failed parse of the command line.
java.io.IOException
java.text.ParseException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |