org.archive.crawler.postprocessor
Class ContentBasedWaitEvaluator

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.postprocessor.WaitEvaluator
                          extended by org.archive.crawler.postprocessor.ContentBasedWaitEvaluator
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, AdaptiveRevisitAttributeConstants
Direct Known Subclasses:
ImageWaitEvaluator, TextWaitEvaluator

public class ContentBasedWaitEvaluator
extends WaitEvaluator

A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression. If it matches, then the wait evaluation is performed. Otherwise the processor passes on the CrawlURI, doing nothing.

Author:
Kristinn Sigurdsson
See Also:
WaitEvaluator, Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_CONTENT_REGEXPR
          The regular expression that we limit this evaluator to.
protected static java.lang.String DEFAULT_CONTENT_REGEXPR
           
 
Fields inherited from class org.archive.crawler.postprocessor.WaitEvaluator
ATTR_CHANGED_FACTOR, ATTR_DEFAULT_WAIT_INTERVAL, ATTR_INITIAL_WAIT_INTERVAL, ATTR_MAX_WAIT_INTERVAL, ATTR_MIN_WAIT_INTERVAL, ATTR_UNCHANGED_FACTOR, ATTR_USE_OVERDUE_TIME, DEFAULT_CHANGED_FACTOR, DEFAULT_DEFAULT_WAIT_INTERVAL, DEFAULT_INITIAL_WAIT_INTERVAL, DEFAULT_MAX_WAIT_INTERVAL, DEFAULT_MIN_WAIT_INTERVAL, DEFAULT_UNCHANGED_FACTOR, DEFAULT_USE_OVERDUE_TIME, logger
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.frontier.AdaptiveRevisitAttributeConstants
A_CONTENT_STATE_KEY, A_DISCARD_REVISIT, A_FETCH_OVERDUE, A_LAST_CONTENT_DIGEST, A_LAST_DATESTAMP, A_LAST_ETAG, A_NUMBER_OF_VERSIONS, A_NUMBER_OF_VISITS, A_TIME_OF_NEXT_PROCESSING, A_WAIT_INTERVAL, A_WAIT_REEVALUATED, CONTENT_CHANGED, CONTENT_UNCHANGED, CONTENT_UNKNOWN
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
ContentBasedWaitEvaluator(java.lang.String name)
          Constructor
ContentBasedWaitEvaluator(java.lang.String name, java.lang.String description, java.lang.String defaultRegExpr, java.lang.Long default_inital_wait_interval, java.lang.Long default_max_wait_interval, java.lang.Long default_min_wait_interval, java.lang.Double default_unchanged_factor, java.lang.Double default_changed_factor)
          Constructor
 
Method Summary
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_CONTENT_REGEXPR

public static final java.lang.String ATTR_CONTENT_REGEXPR
The regular expression that we limit this evaluator to.

See Also:
Constant Field Values

DEFAULT_CONTENT_REGEXPR

protected static final java.lang.String DEFAULT_CONTENT_REGEXPR
See Also:
Constant Field Values
Constructor Detail

ContentBasedWaitEvaluator

public ContentBasedWaitEvaluator(java.lang.String name)
Constructor

Parameters:
name - The name of the module

ContentBasedWaitEvaluator

public ContentBasedWaitEvaluator(java.lang.String name,
                                 java.lang.String description,
                                 java.lang.String defaultRegExpr,
                                 java.lang.Long default_inital_wait_interval,
                                 java.lang.Long default_max_wait_interval,
                                 java.lang.Long default_min_wait_interval,
                                 java.lang.Double default_unchanged_factor,
                                 java.lang.Double default_changed_factor)
Constructor

Parameters:
name - The name of the module
description - Description of the module
default_inital_wait_interval - The default value for initial wait time
default_max_wait_interval - The maximum value for wait time
default_min_wait_interval - The minimum value for wait time
default_unchanged_factor - The factor for changing wait times of unchanged documents (will be multiplied by this value)
default_changed_factor - The factor for changing wait times of changed documents (will be divided by this value)
Method Detail

innerProcess

protected void innerProcess(CrawlURI curi)
                     throws java.lang.InterruptedException
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class WaitEvaluator
Parameters:
curi - The CrawlURI being processed.
Throws:
java.lang.InterruptedException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.