ARCWriter (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.io.arc
Class ARCWriter

java.lang.Object
  org.archive.io.WriterPoolMember
      org.archive.io.arc.ARCWriter

All Implemented Interfaces:: ARCConstants, ArchiveFileConstants

public class ARCWriter
extends WriterPoolMember
implements ARCConstants
extends WriterPoolMember
implements ARCConstants

Write ARC files. Assumption is that the caller is managing access to this ARCWriter ensuring only one thread of control accessing this ARC file instance at any one time.

ARC files are described here: Arc File Format. This class does version 1 of the ARC file format. It also writes version 1.1 which is version 1 with data stuffed into the body of the first arc record in the file, the arc file meta record itself.

An ARC file is three lines of meta data followed by an optional 'body' and then a couple of '\n' and then: record, '\n', record, '\n', record, etc. If we are writing compressed ARC files, then each of the ARC file records is individually gzipped and concatenated together to make up a single ARC file. In GZIP terms, each ARC record is a GZIP member of a total gzip'd file.

The GZIPping of the ARC file meta data is exceptional. It is GZIPped w/ an extra GZIP header, a special Internet Archive (IA) extra header field (e.g. FEXTRA is set in the GZIP header FLG field and an extra field is appended to the GZIP header). The extra field has little in it but its presence denotes this GZIP as an Internet Archive gzipped ARC. See RFC1952 to learn about the GZIP header structure.

This class then does its GZIPping in the following fashion. Each GZIP member is written w/ a new instance of GZIPOutputStream -- actually ARCWriterGZIPOututStream so we can get access to the underlying stream. The underlying stream stays open across GZIPoutputStream instantiations. For the 'special' GZIPing of the ARC file meta data, we cheat by catching the GZIPOutputStream output into a byte array, manipulating it adding the IA GZIP header, before writing to the stream.

I tried writing a resettable GZIPOutputStream and could make it work w/ the SUN JDK but the IBM JDK threw NPE inside in the deflate.reset -- its zlib native call doesn't seem to like the notion of resetting -- so I gave up on it.

Because of such as the above and troubles with GZIPInputStream, we should write our own GZIP*Streams, ones that resettable and consious of gzip members.

This class will write until we hit >= maxSize. The check is done at record boundary. Records do not span ARC files. We will then close current file and open another and then continue writing.

TESTING: Here is how to test that produced ARC files are good using the alexa ARC c-tools:

 % av_procarc hx20040109230030-0.arc.gz | av_ziparc > \
     /tmp/hx20040109230030-0.dat.gz
 % av_ripdat /tmp/hx20040109230030-0.dat.gz > /tmp/hx20040109230030-0.cdx

Examine the produced cdx file to make sure it makes sense. Search for 'no-type 0'. If found, then we're opening a gzip record w/o data to write. This is bad.

You can also do gzip -t FILENAME and it will tell you if the ARC makes sense to GZIP.

While being written, ARCs have a '.open' suffix appended.

Author:: stack

Field Summary

Fields inherited from class org.archive.io.WriterPoolMember
`DEFAULT_PREFIX, DEFAULT_SUFFIX, HOSTNAME_ADMINPORT_VARIABLE, HOSTNAME_VARIABLE, UTF8`

Fields inherited from interface org.archive.io.arc.ARCConstants
ARC_FILE_EXTENSION, ARC_GZIP_EXTRA_FIELD, ARC_MAGIC_NUMBER, CHECKSUM_FIELD_KEY, CHECKSUM_HEADER_FIELD_KEY, CODE_HEADER_FIELD_KEY, COMPRESSED_ARC_FILE_EXTENSION, DEFAULT_ENCODING, DEFAULT_GZIP_HEADER_LENGTH, DEFAULT_MAX_ARC_FILE_SIZE, DOT_ARC_FILE_EXTENSION, DOT_COMPRESSED_ARC_FILE_EXTENSION, DOT_COMPRESSED_FILE_EXTENSION, FILENAME_FIELD_KEY, FILENAME_HEADER_FIELD_KEY, GZIP_HEADER_BEGIN, HEADER_FIELD_SEPARATOR, IP_HEADER_FIELD_KEY, LINE_SEPARATOR, LOCATION_HEADER_FIELD_KEY, MAX_METADATA_LINE_LENGTH, MINIMUM_RECORD_LENGTH, OFFSET_FIELD_KEY, OFFSET_HEADER_FIELD_KEY, REQUIRED_VERSION_1_HEADER_FIELDS, STATUSCODE_FIELD_KEY, TOKENIZED_PREFIX

Fields inherited from interface org.archive.io.arc.ARCConstants

ARC_FILE_EXTENSION, ARC_GZIP_EXTRA_FIELD, ARC_MAGIC_NUMBER, CHECKSUM_FIELD_KEY, CHECKSUM_HEADER_FIELD_KEY, CODE_HEADER_FIELD_KEY, COMPRESSED_ARC_FILE_EXTENSION, DEFAULT_ENCODING, DEFAULT_GZIP_HEADER_LENGTH, DEFAULT_MAX_ARC_FILE_SIZE, DOT_ARC_FILE_EXTENSION, DOT_COMPRESSED_ARC_FILE_EXTENSION, DOT_COMPRESSED_FILE_EXTENSION, FILENAME_FIELD_KEY, FILENAME_HEADER_FIELD_KEY, GZIP_HEADER_BEGIN, HEADER_FIELD_SEPARATOR, IP_HEADER_FIELD_KEY, LINE_SEPARATOR, LOCATION_HEADER_FIELD_KEY, MAX_METADATA_LINE_LENGTH, MINIMUM_RECORD_LENGTH, OFFSET_FIELD_KEY, OFFSET_HEADER_FIELD_KEY, REQUIRED_VERSION_1_HEADER_FIELDS, STATUSCODE_FIELD_KEY, TOKENIZED_PREFIX

Fields inherited from interface org.archive.io.ArchiveFileConstants
`ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY`

Constructor Summary
`ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.util.List<java.io.File> dirs, java.lang.String prefix, boolean cmprs, long maxSize)` Constructor.
`ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.util.List<java.io.File> dirs, java.lang.String prefix, java.lang.String suffix, boolean cmprs, long maxSize, java.util.List meta)` Constructor.
`ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.io.PrintStream out, java.io.File arc, boolean cmprs, java.lang.String a14DigitDate, java.util.List metadata)` Constructor.

Method Summary
`protected java.lang.String`	`createFile()` Create a new file.
`java.lang.String`	`createMetaline(java.lang.String uri, java.lang.String hostIP, java.lang.String timeStamp, java.lang.String mimetype, java.lang.String recordLength)`
`java.lang.String`	`getMetadataHeaderLinesTwoAndThree(java.lang.String version)`
`protected java.lang.String`	`getMetaLine(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength)`
`protected java.lang.String`	`validateMetaLine(java.lang.String metaLineStr)` Test that the metadata line is valid before writing.
`void`	`write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.ByteArrayOutputStream baos)` Deprecated. use input-stream version directly instead
`void`	`write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.InputStream in)`
`void`	`write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.InputStream in, boolean enforceLength)` Write a record with the given metadata/content.

Methods inherited from class org.archive.io.WriterPoolMember
`checkSize, checkWriteable, close, copyFrom, createFile, flush, getBaseFilename, getCreateTimestamp, getFile, getNextDirectory, getOutputStream, getPosition, getTimestampSerialNo, getTimestampSerialNo, isCompressed, postWriteRecordTasks, preWriteRecordTasks, readFullyFrom, readToLimitFrom, write, write, write`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

ARCWriter

public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
                 java.io.PrintStream out,
                 java.io.File arc,
                 boolean cmprs,
                 java.lang.String a14DigitDate,
                 java.util.List metadata)
          throws java.io.IOException

Constructor. Takes a stream. Use with caution. There is no upperbound check on size. Will just keep writing.

Parameters:: serialNo - used to generate unique file name sequences; out - Where to write.; arc - File the out is connected to.; cmprs - Compress the content written.; metadata - File meta data. Can be null. Is list of File and/or String objects.; a14DigitDate - If null, we'll write current time.
Throws:: java.io.IOException

ARCWriter

public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
                 java.util.List<java.io.File> dirs,
                 java.lang.String prefix,
                 boolean cmprs,
                 long maxSize)

Constructor.

Parameters:: serialNo - used to generate unique file name sequences; dirs - Where to drop the ARC files.; prefix - ARC file prefix to use. If null, we use DEFAULT_ARC_FILE_PREFIX.; cmprs - Compress the ARC files written. The compression is done by individually gzipping each record added to the ARC file: i.e. the ARC file is a bunch of gzipped records concatenated together.; maxSize - Maximum size for ARC files written.

ARCWriter

public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
                 java.util.List<java.io.File> dirs,
                 java.lang.String prefix,
                 java.lang.String suffix,
                 boolean cmprs,
                 long maxSize,
                 java.util.List meta)

Constructor.

Parameters:: serialNo - used to generate unique file name sequences; dirs - Where to drop files.; prefix - File prefix to use.; cmprs - Compress the records written.; maxSize - Maximum size for ARC files written.; suffix - File tail to use. If null, unused.; meta - File meta data. Can be null. Is list of File and/or String objects.

Method Detail

createFile

protected java.lang.String createFile()
                               throws java.io.IOException

Description copied from class: WriterPoolMember

Create a new file. Rotates off the current Writer and creates a new in its place to take subsequent writes. Usually called from WriterPoolMember.checkSize().

Overrides:: createFile in class WriterPoolMember

Returns:: Name of file created.
Throws:: java.io.IOException

getMetadataHeaderLinesTwoAndThree

public java.lang.String getMetadataHeaderLinesTwoAndThree(java.lang.String version)

write

public void write(java.lang.String uri,
                  java.lang.String contentType,
                  java.lang.String hostIP,
                  long fetchBeginTimeStamp,
                  long recordLength,
                  java.io.ByteArrayOutputStream baos)
           throws java.io.IOException

Deprecated. use input-stream version directly instead

Throws:: java.io.IOException

write

public void write(java.lang.String uri,
                  java.lang.String contentType,
                  java.lang.String hostIP,
                  long fetchBeginTimeStamp,
                  long recordLength,
                  java.io.InputStream in)
           throws java.io.IOException

Throws:: java.io.IOException

write

public void write(java.lang.String uri,
                  java.lang.String contentType,
                  java.lang.String hostIP,
                  long fetchBeginTimeStamp,
                  long recordLength,
                  java.io.InputStream in,
                  boolean enforceLength)
           throws java.io.IOException

Write a record with the given metadata/content.

Parameters:: uri - URI for metadata-line; contentType - MIME content-type for metadata-line; hostIP - IP for metadata-line; fetchBeginTimeStamp - timestamp for metadata-line; recordLength - length for metadata-line; also may be enforced; in - source InputStream for record content; enforceLength - whether to enforce the declared length; should be true unless intentionally writing bad records for testing
Throws:: java.io.IOException

getMetaLine

protected java.lang.String getMetaLine(java.lang.String uri,
                                       java.lang.String contentType,
                                       java.lang.String hostIP,
                                       long fetchBeginTimeStamp,
                                       long recordLength)
                                throws java.io.IOException

Parameters:: uri -; contentType -; hostIP -; fetchBeginTimeStamp -; recordLength -
Returns:: Metadata line for an ARCRecord made of passed components.
Throws:: java.io.IOException

createMetaline

public java.lang.String createMetaline(java.lang.String uri,
                                       java.lang.String hostIP,
                                       java.lang.String timeStamp,
                                       java.lang.String mimetype,
                                       java.lang.String recordLength)

validateMetaLine

protected java.lang.String validateMetaLine(java.lang.String metaLineStr)
                                     throws java.io.IOException

Test that the metadata line is valid before writing.

Parameters:: metaLineStr -
Returns:: The passed in metaline.
Throws:: java.io.IOException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.io.arc Class ARCWriter

ARCWriter

ARCWriter

ARCWriter

createFile

getMetadataHeaderLinesTwoAndThree

write

write

write

getMetaLine

createMetaline

validateMetaLine

org.archive.io.arc
Class ARCWriter