org.archive.crawler.writer
Class ARCWriterProcessor
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.framework.WriterPoolProcessor
org.archive.crawler.writer.ARCWriterProcessor
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, CrawlStatusListener, ARCConstants, ArchiveFileConstants, WriterPoolSettings
public class ARCWriterProcessor
- extends WriterPoolProcessor
- implements CoreAttributeConstants, ARCConstants, CrawlStatusListener, WriterPoolSettings, FetchStatusCodes
Processor module for writing the results of successful fetches (and
perhaps someday, certain kinds of network failures) to the Internet Archive
ARC file format.
Assumption is that there is only one of these ARCWriterProcessors per
Heritrix instance.
- Author:
- Parker Thompson
- See Also:
- Serialized Form
Fields inherited from class org.archive.crawler.framework.WriterPoolProcessor |
ANNOTATION_UNWRITTEN, ATTR_COMPRESS, ATTR_MAX_BYTES_WRITTEN, ATTR_MAX_SIZE_BYTES, ATTR_PATH, ATTR_POOL_MAX_ACTIVE, ATTR_POOL_MAX_WAIT, ATTR_PREFIX, ATTR_SKIP_IDENTICAL_DIGESTS, ATTR_SUFFIX, DEFAULT_COMPRESS |
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX |
Fields inherited from interface org.archive.io.arc.ARCConstants |
ARC_FILE_EXTENSION, ARC_GZIP_EXTRA_FIELD, ARC_MAGIC_NUMBER, CHECKSUM_FIELD_KEY, CHECKSUM_HEADER_FIELD_KEY, CODE_HEADER_FIELD_KEY, COMPRESSED_ARC_FILE_EXTENSION, DEFAULT_ENCODING, DEFAULT_GZIP_HEADER_LENGTH, DEFAULT_MAX_ARC_FILE_SIZE, DOT_ARC_FILE_EXTENSION, DOT_COMPRESSED_ARC_FILE_EXTENSION, DOT_COMPRESSED_FILE_EXTENSION, FILENAME_FIELD_KEY, FILENAME_HEADER_FIELD_KEY, GZIP_HEADER_BEGIN, HEADER_FIELD_SEPARATOR, IP_HEADER_FIELD_KEY, LINE_SEPARATOR, LOCATION_HEADER_FIELD_KEY, MAX_METADATA_LINE_LENGTH, MINIMUM_RECORD_LENGTH, OFFSET_FIELD_KEY, OFFSET_HEADER_FIELD_KEY, REQUIRED_VERSION_1_HEADER_FIELDS, STATUSCODE_FIELD_KEY, TOKENIZED_PREFIX |
Fields inherited from interface org.archive.io.ArchiveFileConstants |
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY |
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes |
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE |
Methods inherited from class org.archive.crawler.framework.WriterPoolProcessor |
cacheMetadata, checkBytesWritten, checkpointRecover, crawlCheckpoint, crawlEnded, crawlEnding, crawlPaused, crawlPausing, crawlResuming, crawlStarted, getAttributeUnchecked, getCheckpointStateFile, getFirstrecordBody, getHostAddress, getMaxSize, getMaxToWrite, getMetadata, getOutputDirs, getPool, getPoolMaximumActive, getPoolMaximumWait, getPrefix, getSerialNo, getSuffix, getTotalBytesWritten, initialTasks, isCompressed, loadCheckpointSerialNumber, saveCheckpointSerialNumber, setPool, setTotalBytesWritten, shouldWrite |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
ARCWriterProcessor
public ARCWriterProcessor(java.lang.String name)
- Parameters:
name
- Name of this writer.
getDefaultMaxFileSize
public long getDefaultMaxFileSize()
- Description copied from class:
WriterPoolProcessor
- Default maximum file size.
- Specified by:
getDefaultMaxFileSize
in class WriterPoolProcessor
getDefaultPath
protected java.lang.String[] getDefaultPath()
- Overrides:
getDefaultPath
in class WriterPoolProcessor
setupPool
protected void setupPool(java.util.concurrent.atomic.AtomicInteger serialNo)
- Description copied from class:
WriterPoolProcessor
- Set up pool of files.
- Specified by:
setupPool
in class WriterPoolProcessor
innerProcess
protected void innerProcess(CrawlURI curi)
- Writes a CrawlURI and its associated data to store file.
Currently this method understands the following uri types: dns, http,
and https.
- Specified by:
innerProcess
in class WriterPoolProcessor
- Parameters:
curi
- CrawlURI to process.
write
protected void write(CrawlURI curi,
long recordLength,
java.io.InputStream in,
java.lang.String ip)
throws java.io.IOException
- Throws:
java.io.IOException
getFirstrecordStylesheet
protected java.lang.String getFirstrecordStylesheet()
- Overrides:
getFirstrecordStylesheet
in class WriterPoolProcessor
Copyright © 2003-2011 Internet Archive. All Rights Reserved.