|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.io.WriterPoolMember org.archive.io.arc.ARCWriter
public class ARCWriter
Write ARC files. Assumption is that the caller is managing access to this ARCWriter ensuring only one thread of control accessing this ARC file instance at any one time.
ARC files are described here: Arc File Format. This class does version 1 of the ARC file format. It also writes version 1.1 which is version 1 with data stuffed into the body of the first arc record in the file, the arc file meta record itself.
An ARC file is three lines of meta data followed by an optional 'body' and then a couple of '\n' and then: record, '\n', record, '\n', record, etc. If we are writing compressed ARC files, then each of the ARC file records is individually gzipped and concatenated together to make up a single ARC file. In GZIP terms, each ARC record is a GZIP member of a total gzip'd file.
The GZIPping of the ARC file meta data is exceptional. It is GZIPped w/ an extra GZIP header, a special Internet Archive (IA) extra header field (e.g. FEXTRA is set in the GZIP header FLG field and an extra field is appended to the GZIP header). The extra field has little in it but its presence denotes this GZIP as an Internet Archive gzipped ARC. See RFC1952 to learn about the GZIP header structure.
This class then does its GZIPping in the following fashion. Each GZIP member is written w/ a new instance of GZIPOutputStream -- actually ARCWriterGZIPOututStream so we can get access to the underlying stream. The underlying stream stays open across GZIPoutputStream instantiations. For the 'special' GZIPing of the ARC file meta data, we cheat by catching the GZIPOutputStream output into a byte array, manipulating it adding the IA GZIP header, before writing to the stream.
I tried writing a resettable GZIPOutputStream and could make it work w/ the SUN JDK but the IBM JDK threw NPE inside in the deflate.reset -- its zlib native call doesn't seem to like the notion of resetting -- so I gave up on it.
Because of such as the above and troubles with GZIPInputStream, we should write our own GZIP*Streams, ones that resettable and consious of gzip members.
This class will write until we hit >= maxSize. The check is done at record boundary. Records do not span ARC files. We will then close current file and open another and then continue writing.
TESTING: Here is how to test that produced ARC files are good using the alexa ARC c-tools:
% av_procarc hx20040109230030-0.arc.gz | av_ziparc > \ /tmp/hx20040109230030-0.dat.gz % av_ripdat /tmp/hx20040109230030-0.dat.gz > /tmp/hx20040109230030-0.cdxExamine the produced cdx file to make sure it makes sense. Search for 'no-type 0'. If found, then we're opening a gzip record w/o data to write. This is bad.
You can also do gzip -t FILENAME
and it will tell you if the
ARC makes sense to GZIP.
While being written, ARCs have a '.open' suffix appended.
Field Summary |
---|
Fields inherited from class org.archive.io.WriterPoolMember |
---|
DEFAULT_PREFIX, DEFAULT_SUFFIX, HOSTNAME_ADMINPORT_VARIABLE, HOSTNAME_VARIABLE, UTF8 |
Fields inherited from interface org.archive.io.ArchiveFileConstants |
---|
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY |
Constructor Summary | |
---|---|
ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
java.util.List<java.io.File> dirs,
java.lang.String prefix,
boolean cmprs,
long maxSize)
Constructor. |
|
ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
java.util.List<java.io.File> dirs,
java.lang.String prefix,
java.lang.String suffix,
boolean cmprs,
long maxSize,
java.util.List meta)
Constructor. |
|
ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo,
java.io.PrintStream out,
java.io.File arc,
boolean cmprs,
java.lang.String a14DigitDate,
java.util.List metadata)
Constructor. |
Method Summary | |
---|---|
protected java.lang.String |
createFile()
Create a new file. |
java.lang.String |
createMetaline(java.lang.String uri,
java.lang.String hostIP,
java.lang.String timeStamp,
java.lang.String mimetype,
java.lang.String recordLength)
|
java.lang.String |
getMetadataHeaderLinesTwoAndThree(java.lang.String version)
|
protected java.lang.String |
getMetaLine(java.lang.String uri,
java.lang.String contentType,
java.lang.String hostIP,
long fetchBeginTimeStamp,
long recordLength)
|
protected java.lang.String |
validateMetaLine(java.lang.String metaLineStr)
Test that the metadata line is valid before writing. |
void |
write(java.lang.String uri,
java.lang.String contentType,
java.lang.String hostIP,
long fetchBeginTimeStamp,
long recordLength,
java.io.ByteArrayOutputStream baos)
Deprecated. use input-stream version directly instead |
void |
write(java.lang.String uri,
java.lang.String contentType,
java.lang.String hostIP,
long fetchBeginTimeStamp,
long recordLength,
java.io.InputStream in)
|
void |
write(java.lang.String uri,
java.lang.String contentType,
java.lang.String hostIP,
long fetchBeginTimeStamp,
long recordLength,
java.io.InputStream in,
boolean enforceLength)
Write a record with the given metadata/content. |
Methods inherited from class org.archive.io.WriterPoolMember |
---|
checkSize, checkWriteable, close, copyFrom, createFile, flush, getBaseFilename, getCreateTimestamp, getFile, getNextDirectory, getOutputStream, getPosition, getTimestampSerialNo, getTimestampSerialNo, isCompressed, postWriteRecordTasks, preWriteRecordTasks, readFullyFrom, readToLimitFrom, write, write, write |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.io.PrintStream out, java.io.File arc, boolean cmprs, java.lang.String a14DigitDate, java.util.List metadata) throws java.io.IOException
serialNo
- used to generate unique file name sequencesout
- Where to write.arc
- File the out
is connected to.cmprs
- Compress the content written.metadata
- File meta data. Can be null. Is list of File and/or
String objects.a14DigitDate
- If null, we'll write current time.
java.io.IOException
public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.util.List<java.io.File> dirs, java.lang.String prefix, boolean cmprs, long maxSize)
serialNo
- used to generate unique file name sequencesdirs
- Where to drop the ARC files.prefix
- ARC file prefix to use. If null, we use
DEFAULT_ARC_FILE_PREFIX.cmprs
- Compress the ARC files written. The compression is done
by individually gzipping each record added to the ARC file: i.e. the
ARC file is a bunch of gzipped records concatenated together.maxSize
- Maximum size for ARC files written.public ARCWriter(java.util.concurrent.atomic.AtomicInteger serialNo, java.util.List<java.io.File> dirs, java.lang.String prefix, java.lang.String suffix, boolean cmprs, long maxSize, java.util.List meta)
serialNo
- used to generate unique file name sequencesdirs
- Where to drop files.prefix
- File prefix to use.cmprs
- Compress the records written.maxSize
- Maximum size for ARC files written.suffix
- File tail to use. If null, unused.meta
- File meta data. Can be null. Is list of File and/or
String objects.Method Detail |
---|
protected java.lang.String createFile() throws java.io.IOException
WriterPoolMember
WriterPoolMember.checkSize()
.
createFile
in class WriterPoolMember
java.io.IOException
public java.lang.String getMetadataHeaderLinesTwoAndThree(java.lang.String version)
public void write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.ByteArrayOutputStream baos) throws java.io.IOException
java.io.IOException
public void write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.InputStream in) throws java.io.IOException
java.io.IOException
public void write(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength, java.io.InputStream in, boolean enforceLength) throws java.io.IOException
uri
- URI for metadata-linecontentType
- MIME content-type for metadata-linehostIP
- IP for metadata-linefetchBeginTimeStamp
- timestamp for metadata-linerecordLength
- length for metadata-line; also may be enforcedin
- source InputStream for record contentenforceLength
- whether to enforce the declared length; should be true
unless intentionally writing bad records for testing
java.io.IOException
protected java.lang.String getMetaLine(java.lang.String uri, java.lang.String contentType, java.lang.String hostIP, long fetchBeginTimeStamp, long recordLength) throws java.io.IOException
uri
- contentType
- hostIP
- fetchBeginTimeStamp
- recordLength
-
java.io.IOException
public java.lang.String createMetaline(java.lang.String uri, java.lang.String hostIP, java.lang.String timeStamp, java.lang.String mimetype, java.lang.String recordLength)
protected java.lang.String validateMetaLine(java.lang.String metaLineStr) throws java.io.IOException
metaLineStr
-
java.io.IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |