org.archive.crawler.extractor
Class HTTPContentDigest

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.HTTPContentDigest
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean

public class HTTPContentDigest
extends Processor

A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.

This processor allows the user to specify a regular expression called strip-reg-expr. Any segment of a document (text only, binary files will be skipped) that matches this regular expression will by rewritten with the blank character (character 32 in the ANSI character set) for the purpose of the digest this has no effect on the document for subsequent processing or archiving.

NOTE: Content digest only accounts for the document body, not headers.

The operator will also be able to specify a maximum length for documents being evaluated by this processors. Documents exceeding that length will be ignored.

To further discriminate by file type or URL, an operator should use the override and refinement options.

It is generally recommended that this recalculation only be performed when absolutely needed (because of stripping data that changes automatically each time the URL is fetched) as this is an expensive operation.

Author:
Kristinn Sigurdsson
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_MAX_SIZE_BYTES
          Maximum file size for - longer files will be ignored.
static java.lang.String ATTR_STRIP_REG_EXPR
          A regular expression detailing elements to strip before making digest
protected static java.lang.Long DEFAULT_MAX_SIZE_BYTES
           
protected static java.lang.String DEFAULT_STRIP_REG_EXPR
           
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
HTTPContentDigest(java.lang.String name)
          Constructor
 
Method Summary
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_STRIP_REG_EXPR

public static final java.lang.String ATTR_STRIP_REG_EXPR
A regular expression detailing elements to strip before making digest

See Also:
Constant Field Values

DEFAULT_STRIP_REG_EXPR

protected static final java.lang.String DEFAULT_STRIP_REG_EXPR
See Also:
Constant Field Values

ATTR_MAX_SIZE_BYTES

public static final java.lang.String ATTR_MAX_SIZE_BYTES
Maximum file size for - longer files will be ignored. -1 = unlimited

See Also:
Constant Field Values

DEFAULT_MAX_SIZE_BYTES

protected static final java.lang.Long DEFAULT_MAX_SIZE_BYTES
Constructor Detail

HTTPContentDigest

public HTTPContentDigest(java.lang.String name)
Constructor

Parameters:
name - Processor name
Method Detail

innerProcess

protected void innerProcess(CrawlURI curi)
                     throws java.lang.InterruptedException
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.
Throws:
java.lang.InterruptedException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.