org.archive.io.warc
Class WARCReaderFactory

java.lang.Object
  extended by org.archive.io.ArchiveReaderFactory
      extended by org.archive.io.warc.WARCReaderFactory
All Implemented Interfaces:
ArchiveFileConstants, WARCConstants

public class WARCReaderFactory
extends ArchiveReaderFactory
implements WARCConstants

Factory for WARC Readers. Figures whether to give out a compressed file Reader or an uncompressed Reader.

Version:
$Date: 2006-08-23 17:59:04 -0700 (Wed, 23 Aug 2006) $ $Version$
Author:
stack

Nested Class Summary
 class WARCReaderFactory.CompressedWARCReader
          Compressed WARC file reader.
 class WARCReaderFactory.UncompressedWARCReader
          Uncompressed WARC file reader.
 
Field Summary
 
Fields inherited from interface org.archive.io.warc.WARCConstants
COLON_SPACE, COMPRESSED_WARC_FILE_EXTENSION, CONTENT_DESCRIPTION, CONTENT_LENGTH, CONTENT_TYPE, CONTINUATION, CONTINUATION_INDEX, CONVERSION, CONVERSION_INDEX, DEFAULT_ENCODING, DEFAULT_MAX_WARC_FILE_SIZE, DOT_COMPRESSED_FILE_EXTENSION, DOT_COMPRESSED_WARC_FILE_EXTENSION, DOT_WARC_FILE_EXTENSION, FTP_CONTROL_CONVERSATION_MIMETYPE, HEADER_FIELD_KEYS, HEADER_FIELD_SEPARATOR, HEADER_KEY_BLOCK_DIGEST, HEADER_KEY_CONCURRENT_TO, HEADER_KEY_DATE, HEADER_KEY_ETAG, HEADER_KEY_FILENAME, HEADER_KEY_ID, HEADER_KEY_IP, HEADER_KEY_LAST_MODIFIED, HEADER_KEY_PAYLOAD_DIGEST, HEADER_KEY_PROFILE, HEADER_KEY_TRUNCATED, HEADER_KEY_TYPE, HEADER_KEY_URI, HEADER_LINE_ENCODING, HTTP_REQUEST_MIMETYPE, HTTP_RESPONSE_MIMETYPE, MAX_LINE_LENGTH, MAX_WARC_HEADER_LINE_LENGTH, METADATA, METADATA_INDEX, NAMED_FIELD_CHECKSUM_LABEL, NAMED_FIELD_DESCRIPTION, NAMED_FIELD_FILEDESC, NAMED_FIELD_IP_LABEL, NAMED_FIELD_RELATED_LABEL, NAMED_FIELD_TRUNCATED, NAMED_FIELD_TRUNCATED_VALUE_HEAD, NAMED_FIELD_TRUNCATED_VALUE_LENGTH, NAMED_FIELD_TRUNCATED_VALUE_TIME, NAMED_FIELD_TRUNCATED_VALUE_UNSPECIFIED, NAMED_FIELD_WARCFILENAME, PLACEHOLDER_RECORD_LENGTH_STRING, PROFILE_REVISIT_IDENTICAL_DIGEST, PROFILE_REVISIT_NOT_MODIFIED, REQUEST, REQUEST_INDEX, RESOURCE, RESOURCE_INDEX, RESPONSE, RESPONSE_INDEX, REVISIT, REVISIT_INDEX, TRUNCATED_VALUE_UNSPECIFIED, TYPE, TYPES, TYPES_LIST, WARC_010_ID, WARC_010_MAGIC, WARC_FILE_EXTENSION, WARC_HEADER_ENCODING, WARC_ID, WARC_MAGIC, WARC_VERSION, WARCINFO, WARCINFO_INDEX, WSP
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Method Summary
static WARCReader get(java.io.File f)
           
static WARCReader get(java.io.File f, long offset)
           
static WARCReader get(java.lang.String arcFileOrUrl)
           
static ArchiveReader get(java.lang.String s, java.io.InputStream is, boolean atFirstRecord)
           
static WARCReader get(java.net.URL arcUrl)
          Get an WARCReader.
static WARCReader get(java.net.URL arcUrl, long offset)
           
protected  ArchiveReader getArchiveReader(java.io.File f, long offset)
           
protected  ArchiveReader getArchiveReader(java.lang.String f, java.io.InputStream is, boolean atFirstRecord)
           
static boolean isWARCSuffix(java.lang.String f)
           
static boolean testCompressedWARCFile(java.io.File f)
          Check file is compressed WARC.
 
Methods inherited from class org.archive.io.ArchiveReaderFactory
addUserAgent, asRepositionable, getArchiveReader, getArchiveReader, getArchiveReader, getArchiveReader, getArchiveReader, isCompressed, makeARCLocal
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

get

public static WARCReader get(java.lang.String arcFileOrUrl)
                      throws java.net.MalformedURLException,
                             java.io.IOException
Throws:
java.net.MalformedURLException
java.io.IOException

get

public static WARCReader get(java.io.File f)
                      throws java.io.IOException
Throws:
java.io.IOException

get

public static WARCReader get(java.io.File f,
                             long offset)
                      throws java.io.IOException
Parameters:
f - An arcfile to read.
offset - Have returned Reader set to start reading at this offset.
Returns:
A WARCReader.
Throws:
java.io.IOException

getArchiveReader

protected ArchiveReader getArchiveReader(java.io.File f,
                                         long offset)
                                  throws java.io.IOException
Overrides:
getArchiveReader in class ArchiveReaderFactory
Throws:
java.io.IOException

get

public static ArchiveReader get(java.lang.String s,
                                java.io.InputStream is,
                                boolean atFirstRecord)
                         throws java.io.IOException
Throws:
java.io.IOException

getArchiveReader

protected ArchiveReader getArchiveReader(java.lang.String f,
                                         java.io.InputStream is,
                                         boolean atFirstRecord)
                                  throws java.io.IOException
Overrides:
getArchiveReader in class ArchiveReaderFactory
Throws:
java.io.IOException

get

public static WARCReader get(java.net.URL arcUrl,
                             long offset)
                      throws java.io.IOException
Throws:
java.io.IOException

get

public static WARCReader get(java.net.URL arcUrl)
                      throws java.io.IOException
Get an WARCReader. Pulls the WARC local into wherever the System Property java.io.tmpdir points. It then hands back an ARCReader that points at this local copy. A close on this ARCReader instance will remove the local copy.

Parameters:
arcUrl - An URL that points at an ARC.
Returns:
An ARCReader.
Throws:
java.io.IOException

testCompressedWARCFile

public static boolean testCompressedWARCFile(java.io.File f)
                                      throws java.io.IOException
Check file is compressed WARC.

Parameters:
f - File to test.
Returns:
True if this is compressed WARC (TODO: Just tests if file is GZIP'd file (It begins w/ GZIP MAGIC)).
Throws:
java.io.IOException - If file does not exist or is not unreadable.

isWARCSuffix

public static boolean isWARCSuffix(java.lang.String f)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.