org.archive.crawler.writer
Class MirrorWriterProcessor

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.writer.MirrorWriterProcessor
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants

public class MirrorWriterProcessor
extends Processor
implements CoreAttributeConstants

Processor module that writes the results of successful fetches to files on disk. Writes contents of one URI to one file on disk. The files are arranged in a directory hierarchy based on the URI paths. In that sense they mirror the file hierarchy that might exist on the servers.

There are a number of issues involved:

This class tries very hard to map each URI into a file system path that obeys all file system constraints and yet reasonably represents the original URI.

There would normally be a single instance of this class per Heritrix instance. This class is thread-safe; any number of threads can be in its innerProcess method at once. However, conflicts can still arise in the file system. For example, if several threads try to create the same directory at the same time, only one can win. Therefore, there should be at most one access to a server at a given time.

Author:
Howard Lee Gayle
See Also:
Serialized Form

Nested Class Summary
(package private)  class MirrorWriterProcessor.DirSegment
          This class represents one directory segment (component) of a URI path.
(package private)  class MirrorWriterProcessor.EndSegment
          This class represents the last segment (component) of a URI path.
(package private)  class MirrorWriterProcessor.LumpyString
          This class represents a dynamically growable string consisting of substrings ("lumps") that are treated atomically.
(package private)  class MirrorWriterProcessor.PathSegment
          This class represents one segment (component) of a URI path.
(package private)  class MirrorWriterProcessor.URIToFileReturn
          This class is returned by uriToFile.
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_CASE_SENSITIVE
          Key to use asking settings for case sensitive option.
static java.lang.String ATTR_CHAR_MAP
          Key to use asking settings for character map.
static java.lang.String ATTR_CONTENT_TYPE_MAP
          Key to use asking settings for content type map.
static java.lang.String ATTR_DIRECTORY_FILE
          Key to use asking settings for directory file.
static java.lang.String ATTR_DOT_BEGIN
          Key to use asking settings for dot begin replacement.
static java.lang.String ATTR_DOT_END
          Key to use asking settings for dot end replacement.
static java.lang.String ATTR_HOST_DIRECTORY
          Key to use asking settings for host directory option.
static java.lang.String ATTR_HOST_MAP
          Key to use asking settings for host map.
static java.lang.String ATTR_MAX_PATH_LEN
          Key to use asking settings for maximum file system path length.
static java.lang.String ATTR_MAX_SEG_LEN
          Key to use asking settings for maximum file system path segment length.
static java.lang.String ATTR_PATH
          Key to use asking settings for base directory path value.
static java.lang.String ATTR_PORT_DIRECTORY
          Key to use asking settings for port directory option.
static java.lang.String ATTR_SUFFIX_AT_END
          Key to use asking settings for suffix at end option.
static java.lang.String ATTR_TOO_LONG_DIRECTORY
          Key to use asking settings for too-long directory.
static java.lang.String ATTR_UNDERSCORE_SET
          Key to use asking settings for underscore set.
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
MirrorWriterProcessor(java.lang.String name)
           
 
Method Summary
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_CASE_SENSITIVE

public static final java.lang.String ATTR_CASE_SENSITIVE
Key to use asking settings for case sensitive option.

See Also:
Constant Field Values

ATTR_CHAR_MAP

public static final java.lang.String ATTR_CHAR_MAP
Key to use asking settings for character map.

See Also:
Constant Field Values

ATTR_CONTENT_TYPE_MAP

public static final java.lang.String ATTR_CONTENT_TYPE_MAP
Key to use asking settings for content type map.

See Also:
Constant Field Values

ATTR_DOT_BEGIN

public static final java.lang.String ATTR_DOT_BEGIN
Key to use asking settings for dot begin replacement.

See Also:
Constant Field Values

ATTR_DOT_END

public static final java.lang.String ATTR_DOT_END
Key to use asking settings for dot end replacement.

See Also:
Constant Field Values

ATTR_DIRECTORY_FILE

public static final java.lang.String ATTR_DIRECTORY_FILE
Key to use asking settings for directory file.

See Also:
Constant Field Values

ATTR_HOST_DIRECTORY

public static final java.lang.String ATTR_HOST_DIRECTORY
Key to use asking settings for host directory option.

See Also:
Constant Field Values

ATTR_HOST_MAP

public static final java.lang.String ATTR_HOST_MAP
Key to use asking settings for host map.

See Also:
Constant Field Values

ATTR_MAX_PATH_LEN

public static final java.lang.String ATTR_MAX_PATH_LEN
Key to use asking settings for maximum file system path length.

See Also:
Constant Field Values

ATTR_MAX_SEG_LEN

public static final java.lang.String ATTR_MAX_SEG_LEN
Key to use asking settings for maximum file system path segment length.

See Also:
Constant Field Values

ATTR_PATH

public static final java.lang.String ATTR_PATH
Key to use asking settings for base directory path value.

See Also:
Constant Field Values

ATTR_PORT_DIRECTORY

public static final java.lang.String ATTR_PORT_DIRECTORY
Key to use asking settings for port directory option.

See Also:
Constant Field Values

ATTR_SUFFIX_AT_END

public static final java.lang.String ATTR_SUFFIX_AT_END
Key to use asking settings for suffix at end option.

See Also:
Constant Field Values

ATTR_TOO_LONG_DIRECTORY

public static final java.lang.String ATTR_TOO_LONG_DIRECTORY
Key to use asking settings for too-long directory.

See Also:
Constant Field Values

ATTR_UNDERSCORE_SET

public static final java.lang.String ATTR_UNDERSCORE_SET
Key to use asking settings for underscore set.

See Also:
Constant Field Values
Constructor Detail

MirrorWriterProcessor

public MirrorWriterProcessor(java.lang.String name)
Parameters:
name - Name of this processor.
Method Detail

innerProcess

protected void innerProcess(CrawlURI curi)
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.