org.archive.crawler.writer
Class Kw3WriterProcessor

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.writer.Kw3WriterProcessor
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, Kw3Constants

public class Kw3WriterProcessor
extends Processor
implements CoreAttributeConstants, Kw3Constants

Processor module that writes the results of successful fetches to files on disk. These files are MIME-files of the type used by the Swedish National Library's Kulturarw3 web harvesting [http://www.kb.se/kw3/]. Each URI gets written to its own file and has a path consisting of:

Example: '/53/www.kb.se/current/6879ad79c0ccf886ee8ca55d80e5d6a1.1169211837' The MIME-file itself consists of three parts:

Author:
oskar
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_CHMOD
          Key to use asking settings if chmod should be execuated .
static java.lang.String ATTR_CHMOD_VALUE
          Key to use asking settings for the new chmod value.
static java.lang.String ATTR_COLLECTION
          Key for the collection attribute.
static java.lang.String ATTR_HARVESTER
          Key for the harvester attribute.
static java.lang.String ATTR_MAX_BYTES_WRITTEN
          Key for the maximum ARC bytes to write attribute.
static java.lang.String ATTR_MAX_SIZE_BYTES
          Key to use asking settings for max size value.
static java.lang.String ATTR_PATH
          Key to use asking settings for arc path value.
static java.lang.String DEFAULT_CHMOD_VALUE
          Default value for permissions.
static java.lang.String DEFAULT_COLLECTION_VALUE
          Default value for collection.
static java.lang.String DEFAULT_HARVESTER_VALUE
          Default value for harvester.
static int DEFAULT_MAX_FILE_SIZE
          Default max file size.
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.writer.Kw3Constants
ARCHIVE_TIME_KEY, COLLECTION_KEY, CONTENT_LENGTH_KEY, CONTENT_MD5_KEY, CONTENT_TYPE_KEY, HARVESTER_KEY, HEADER_LENGTH_KEY, HEADER_MD5_KEY, IP_ADDRESS_KEY, STATUS_CODE_KEY, URL_KEY
 
Constructor Summary
Kw3WriterProcessor(java.lang.String name)
           
 
Method Summary
protected  void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected  java.io.OutputStream initOutputStream(CrawlURI curi)
           
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
protected  void writeArchiveInfoPart(java.lang.String boundary, CrawlURI curi, ReplayInputStream ris, java.io.OutputStream out)
           
protected  void writeContentPart(java.lang.String boundary, CrawlURI curi, ReplayInputStream ris, java.io.OutputStream out)
           
protected  void writeHeaderPart(java.lang.String boundary, ReplayInputStream ris, java.io.OutputStream out)
           
protected  void writeMimeFile(CrawlURI curi)
           
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_PATH

public static final java.lang.String ATTR_PATH
Key to use asking settings for arc path value.

See Also:
Constant Field Values

ATTR_MAX_SIZE_BYTES

public static final java.lang.String ATTR_MAX_SIZE_BYTES
Key to use asking settings for max size value.

See Also:
Constant Field Values

DEFAULT_MAX_FILE_SIZE

public static final int DEFAULT_MAX_FILE_SIZE
Default max file size.

See Also:
Constant Field Values

ATTR_CHMOD

public static final java.lang.String ATTR_CHMOD
Key to use asking settings if chmod should be execuated .

See Also:
Constant Field Values

ATTR_CHMOD_VALUE

public static final java.lang.String ATTR_CHMOD_VALUE
Key to use asking settings for the new chmod value.

See Also:
Constant Field Values

DEFAULT_CHMOD_VALUE

public static final java.lang.String DEFAULT_CHMOD_VALUE
Default value for permissions.

See Also:
Constant Field Values

ATTR_MAX_BYTES_WRITTEN

public static final java.lang.String ATTR_MAX_BYTES_WRITTEN
Key for the maximum ARC bytes to write attribute.

See Also:
Constant Field Values

ATTR_COLLECTION

public static final java.lang.String ATTR_COLLECTION
Key for the collection attribute.

See Also:
Constant Field Values

DEFAULT_COLLECTION_VALUE

public static final java.lang.String DEFAULT_COLLECTION_VALUE
Default value for collection.

See Also:
Constant Field Values

ATTR_HARVESTER

public static final java.lang.String ATTR_HARVESTER
Key for the harvester attribute.

See Also:
Constant Field Values

DEFAULT_HARVESTER_VALUE

public static final java.lang.String DEFAULT_HARVESTER_VALUE
Default value for harvester.

See Also:
Constant Field Values
Constructor Detail

Kw3WriterProcessor

public Kw3WriterProcessor(java.lang.String name)
Parameters:
name - Name of this processor.
Method Detail

initialTasks

protected void initialTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.

Overrides:
initialTasks in class Processor

innerProcess

protected void innerProcess(CrawlURI curi)
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.

writeMimeFile

protected void writeMimeFile(CrawlURI curi)
                      throws java.io.IOException
Throws:
java.io.IOException

initOutputStream

protected java.io.OutputStream initOutputStream(CrawlURI curi)
                                         throws java.io.IOException
Throws:
java.io.IOException

writeArchiveInfoPart

protected void writeArchiveInfoPart(java.lang.String boundary,
                                    CrawlURI curi,
                                    ReplayInputStream ris,
                                    java.io.OutputStream out)
                             throws java.io.IOException
Throws:
java.io.IOException

writeHeaderPart

protected void writeHeaderPart(java.lang.String boundary,
                               ReplayInputStream ris,
                               java.io.OutputStream out)
                        throws java.io.IOException
Throws:
java.io.IOException

writeContentPart

protected void writeContentPart(java.lang.String boundary,
                                CrawlURI curi,
                                ReplayInputStream ris,
                                java.io.OutputStream out)
                         throws java.io.IOException
Throws:
java.io.IOException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.