org.archive.io.arc
Class ARCWriter

java.lang.Object
  extended by org.archive.io.WriterPoolMember
      extended by org.archive.io.arc.ARCWriter
All Implemented Interfaces:
ARCConstants, ArchiveFileConstants

public class ARCWriter
extends WriterPoolMember
implements ARCConstants

Write ARC files. Assumption is that the caller is managing access to this ARCWriter ensuring only one thread of control accessing this ARC file instance at any one time.

ARC files are described here: Arc File Format. This class does version 1 of the ARC file format. It also writes version 1.1 which is version 1 with data stuffed into the body of the first arc record in the file, the arc file meta record itself.

An ARC file is three lines of meta data followed by an optional 'body' and then a couple of '\n' and then: record, '\n', record, '\n', record, etc. If we are writing compressed ARC files, then each of the ARC file records is individually gzipped and concatenated together to make up a single ARC file. In GZIP terms, each ARC record is a GZIP member of a total gzip'd file.

The GZIPping of the ARC file meta data is exceptional. It is GZIPped w/ an extra GZIP header, a special Internet Archive (IA) extra header field (e.g. FEXTRA is set in the GZIP header FLG field and an extra field is appended to the GZIP header). The extra field has little in it but its presence denotes this GZIP as an Internet Archive gzipped ARC. See RFC1952 to learn about the GZIP header structure.

This class then does its GZIPping in the following fashion. Each GZIP member is written w/ a new instance of GZIPOutputStream -- actually ARCWriterGZIPOututStream so we can get access to the underlying stream. The underlying stream stays open across GZIPoutputStream instantiations. For the 'special' GZIPing of the ARC file meta data, we cheat by catching the GZIPOutputStream output into a byte array, manipulating it adding the IA GZIP header, before writing to the stream.

I tried writing a resettable GZIPOutputStream and could make it work w/ the SUN JDK but the IBM JDK threw NPE inside in the deflate.reset -- its zlib native call doesn't seem to like the notion of resetting -- so I gave up on it.

Because of such as the above and troubles with GZIPInputStream, we should write our own GZIP*Streams, ones that resettable and consious of gzip members.

This class will write until we hit >= maxSize. The check is done at record boundary. Records do not span ARC files. We will then close current file and open another and then continue writing.

TESTING: Here is how to test that produced ARC files are good using the alexa ARC c-tools:

 % av_procarc hx20040109230030-0.arc.gz | av_ziparc > \
     /tmp/hx20040109230030-0.dat.gz
 % av_ripdat /tmp/hx20040109230030-0.dat.gz > /tmp/hx20040109230030-0.cdx
 
Examine the produced cdx file to make sure it makes sense. Search for 'no-type 0'. If found, then we're opening a gzip record w/o data to write. This is bad.

You can also do gzip -t FILENAME and it will tell you if the ARC makes sense to GZIP.

While being written, ARCs have a '.open' suffix appended.

Author:
stack

Field Summary
 
Fields inherited from class org.archive.io.WriterPoolMember
DEFAULT_PREFIX, DEFAULT_SUFFIX, HOSTNAME_ADMINPORT_VARIABLE, HOSTNAME_VARIABLE, UTF8
 
Fields inherited from interface org.archive.io.arc.ARCConstants
ARC_FILE_EXTENSION, ARC_GZIP_EXTRA_FIELD, ARC_MAGIC_NUMBER, CHECKSUM_FIELD_KEY, CHECKSUM_HEADER_FIELD_KEY, CODE_HEADER_FIELD_KEY, COMPRESSED_ARC_FILE_EXTENSION, DEFAULT_ENCODING, DEFAULT_GZIP_HEADER_LENGTH, DEFAULT_MAX_ARC_FILE_SIZE, DOT_ARC_FILE_EXTENSION, DOT_COMPRESSED_ARC_FILE_EXTENSION, DOT_COMPRESSED_FILE_EXTENSION, FILENAME_FIELD_KEY, FILENAME_HEADER_FIELD_KEY, GZIP_HEADER_BEGIN, HEADER_FIELD_SEPARATOR, IP_HEADER_FIELD_KEY, LINE_SEPARATOR, LOCATION_HEADER_FIELD_KEY, MAX_METADATA_LINE_LENGTH, MINIMUM_RECORD_LENGTH, OFFSET_FIELD_KEY, OFFSET_HEADER_FIELD_KEY, REQUIRED_VERSION_1_HEADER_FIELDS, STATUSCODE_FIELD_KEY, TOKENIZED_PREFIX
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.util.List<java.io.File> dirs, java.lang.String prefix, boolean cmprs, long maxSize)
          Constructor.
ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.util.List<java.io.File> dirs, java.lang.String prefix, java.lang.String suffix, boolean cmprs, long maxSize, java.util.List meta)
          Constructor.
ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.io.PrintStream out, java.io.File arc, boolean cmprs, java.lang.String a14DigitDate, java.util.List metadata)
          Constructor.
 
Method Summary
protected  java.lang.String createFile()
          Create a new file.
 java.lang.String createMetaline(java.lang.String uri, java.lang.String hostIP, java.lang.String timeStamp, java.lang.String mimetype, java.lang.String recordLength)
           
 java.lang.String getMetadataHeaderLinesTwoAndThree(java.lang.String version)
           
protected  java.lang.String getMetaLine(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength)
           
protected  java.lang.String validateMetaLine(java.lang.String metaLineStr)
          Test that the metadata line is valid before writing.
 void write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.ByteArrayOutputStream baos)
          Deprecated. use input-stream version directly instead
 void write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.InputStream in)
           
 void write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.InputStream in, boolean enforceLength)
          Write a record with the given metadata/content.
 
Methods inherited from class org.archive.io.WriterPoolMember
checkSize, checkWriteable, close, copyFrom, createFile, flush, getBaseFilename, getCreateTimestamp, getFile, getNextDirectory, getOutputStream, getPosition, getTimestampSerialNo, getTimestampSerialNo, isCompressed, postWriteRecordTasks, preWriteRecordTasks, readFullyFrom, readToLimitFrom, write, write, write
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ARCWriter

public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
                 java.io.PrintStream out,
                 java.io.File arc,
                 boolean cmprs,
                 java.lang.String a14DigitDate,
                 java.util.List metadata)
          throws java.io.IOException
Constructor. Takes a stream. Use with caution. There is no upperbound check on size. Will just keep writing.

Parameters:
serialNo - used to generate unique file name sequences
out - Where to write.
arc - File the out is connected to.
cmprs - Compress the content written.
metadata - File meta data. Can be null. Is list of File and/or String objects.
a14DigitDate - If null, we'll write current time.
Throws:
java.io.IOException

ARCWriter

public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
                 java.util.List<java.io.File> dirs,
                 java.lang.String prefix,
                 boolean cmprs,
                 long maxSize)
Constructor.

Parameters:
serialNo - used to generate unique file name sequences
dirs - Where to drop the ARC files.
prefix - ARC file prefix to use. If null, we use DEFAULT_ARC_FILE_PREFIX.
cmprs - Compress the ARC files written. The compression is done by individually gzipping each record added to the ARC file: i.e. the ARC file is a bunch of gzipped records concatenated together.
maxSize - Maximum size for ARC files written.

ARCWriter

public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
                 java.util.List<java.io.File> dirs,
                 java.lang.String prefix,
                 java.lang.String suffix,
                 boolean cmprs,
                 long maxSize,
                 java.util.List meta)
Constructor.

Parameters:
serialNo - used to generate unique file name sequences
dirs - Where to drop files.
prefix - File prefix to use.
cmprs - Compress the records written.
maxSize - Maximum size for ARC files written.
suffix - File tail to use. If null, unused.
meta - File meta data. Can be null. Is list of File and/or String objects.
Method Detail

createFile

protected java.lang.String createFile()
                               throws java.io.IOException
Description copied from class: WriterPoolMember
Create a new file. Rotates off the current Writer and creates a new in its place to take subsequent writes. Usually called from WriterPoolMember.checkSize().

Overrides:
createFile in class WriterPoolMember
Returns:
Name of file created.
Throws:
java.io.IOException

getMetadataHeaderLinesTwoAndThree

public java.lang.String getMetadataHeaderLinesTwoAndThree(java.lang.String version)

write

public void write(java.lang.String uri,
                  java.lang.String contentType,
                  java.lang.String hostIP,
                  long fetchBeginTimeStamp,
                  long recordLength,
                  java.io.ByteArrayOutputStream baos)
           throws java.io.IOException
Deprecated. use input-stream version directly instead

Throws:
java.io.IOException

write

public void write(java.lang.String uri,
                  java.lang.String contentType,
                  java.lang.String hostIP,
                  long fetchBeginTimeStamp,
                  long recordLength,
                  java.io.InputStream in)
           throws java.io.IOException
Throws:
java.io.IOException

write

public void write(java.lang.String uri,
                  java.lang.String contentType,
                  java.lang.String hostIP,
                  long fetchBeginTimeStamp,
                  long recordLength,
                  java.io.InputStream in,
                  boolean enforceLength)
           throws java.io.IOException
Write a record with the given metadata/content.

Parameters:
uri - URI for metadata-line
contentType - MIME content-type for metadata-line
hostIP - IP for metadata-line
fetchBeginTimeStamp - timestamp for metadata-line
recordLength - length for metadata-line; also may be enforced
in - source InputStream for record content
enforceLength - whether to enforce the declared length; should be true unless intentionally writing bad records for testing
Throws:
java.io.IOException

getMetaLine

protected java.lang.String getMetaLine(java.lang.String uri,
                                       java.lang.String contentType,
                                       java.lang.String hostIP,
                                       long fetchBeginTimeStamp,
                                       long recordLength)
                                throws java.io.IOException
Parameters:
uri -
contentType -
hostIP -
fetchBeginTimeStamp -
recordLength -
Returns:
Metadata line for an ARCRecord made of passed components.
Throws:
java.io.IOException

createMetaline

public java.lang.String createMetaline(java.lang.String uri,
                                       java.lang.String hostIP,
                                       java.lang.String timeStamp,
                                       java.lang.String mimetype,
                                       java.lang.String recordLength)

validateMetaLine

protected java.lang.String validateMetaLine(java.lang.String metaLineStr)
                                     throws java.io.IOException
Test that the metadata line is valid before writing.

Parameters:
metaLineStr -
Returns:
The passed in metaline.
Throws:
java.io.IOException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.