org.archive.crawler.extractor
Class HTTPContentDigest
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.HTTPContentDigest
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean
public class HTTPContentDigest
- extends Processor
A processor for calculating custum HTTP content digests in place of the
default (if any) computed by the HTTP fetcher processors.
This processor allows the user to specify a regular expression called
strip-reg-expr. Any segment of a document (text only, binary files will
be skipped) that matches this regular expression will by rewritten with
the blank character (character 32 in the ANSI character set) for the
purpose of the digest this has no effect on the document for subsequent
processing or archiving.
NOTE: Content digest only accounts for the document body, not headers.
The operator will also be able to specify a maximum length for documents
being evaluated by this processors. Documents exceeding that length will be
ignored.
To further discriminate by file type or URL, an operator should use the
override and refinement options.
It is generally recommended that this recalculation only be performed when
absolutely needed (because of stripping data that changes automatically each
time the URL is fetched) as this is an expensive operation.
- Author:
- Kristinn Sigurdsson
- See Also:
- Serialized Form
Method Summary |
protected void |
innerProcess(CrawlURI curi)
Classes subclassing this one should override this method to perform
their custom actions on the CrawlURI. |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
ATTR_STRIP_REG_EXPR
public static final java.lang.String ATTR_STRIP_REG_EXPR
- A regular expression detailing elements to strip before making digest
- See Also:
- Constant Field Values
DEFAULT_STRIP_REG_EXPR
protected static final java.lang.String DEFAULT_STRIP_REG_EXPR
- See Also:
- Constant Field Values
ATTR_MAX_SIZE_BYTES
public static final java.lang.String ATTR_MAX_SIZE_BYTES
- Maximum file size for - longer files will be ignored. -1 = unlimited
- See Also:
- Constant Field Values
DEFAULT_MAX_SIZE_BYTES
protected static final java.lang.Long DEFAULT_MAX_SIZE_BYTES
HTTPContentDigest
public HTTPContentDigest(java.lang.String name)
- Constructor
- Parameters:
name
- Processor name
innerProcess
protected void innerProcess(CrawlURI curi)
throws java.lang.InterruptedException
- Description copied from class:
Processor
- Classes subclassing this one should override this method to perform
their custom actions on the CrawlURI.
- Overrides:
innerProcess
in class Processor
- Parameters:
curi
- The CrawlURI being processed.
- Throws:
java.lang.InterruptedException
Copyright © 2003-2011 Internet Archive. All Rights Reserved.