org.archive.crawler.prefetch
Class PreconditionEnforcer

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.prefetch.PreconditionEnforcer
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes

public class PreconditionEnforcer
extends Processor
implements CoreAttributeConstants, FetchStatusCodes

Ensures the preconditions for a fetch -- such as DNS lookup or acquiring and respecting a robots.txt policy -- are satisfied before a URI is passed to subsequent stages.

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_CALCULATE_ROBOTS_ONLY
           
static java.lang.String ATTR_IP_VALIDITY_DURATION
          seconds to keep IP information for
static java.lang.String ATTR_ROBOTS_VALIDITY_DURATION
          seconds to cache robots info
static java.lang.Boolean DEFAULT_CALCULATE_ROBOTS_ONLY
          whether to calculate robots exclusion without applying
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
PreconditionEnforcer(java.lang.String name)
           
 
Method Summary
 long getIPValidityDuration(CrawlURI curi)
          Get the maximum time a dns-record is valid.
 long getRobotsValidityDuration(CrawlURI curi)
          Get the maximum time a robots.txt is valid.
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
 boolean isIpExpired(CrawlURI curi)
          Return true if ip should be looked up.
 boolean isRobotsExpired(CrawlURI curi)
          Is the robots policy expired.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_IP_VALIDITY_DURATION

public static final java.lang.String ATTR_IP_VALIDITY_DURATION
seconds to keep IP information for

See Also:
Constant Field Values

ATTR_ROBOTS_VALIDITY_DURATION

public static final java.lang.String ATTR_ROBOTS_VALIDITY_DURATION
seconds to cache robots info

See Also:
Constant Field Values

DEFAULT_CALCULATE_ROBOTS_ONLY

public static final java.lang.Boolean DEFAULT_CALCULATE_ROBOTS_ONLY
whether to calculate robots exclusion without applying


ATTR_CALCULATE_ROBOTS_ONLY

public static final java.lang.String ATTR_CALCULATE_ROBOTS_ONLY
See Also:
Constant Field Values
Constructor Detail

PreconditionEnforcer

public PreconditionEnforcer(java.lang.String name)
Method Detail

innerProcess

protected void innerProcess(CrawlURI curi)
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.

getIPValidityDuration

public long getIPValidityDuration(CrawlURI curi)
Get the maximum time a dns-record is valid.

Parameters:
curi - the uri this time is valid for.
Returns:
the maximum time a dns-record is valid -- in seconds -- or negative if record's ttl should be used.

isIpExpired

public boolean isIpExpired(CrawlURI curi)
Return true if ip should be looked up.

Parameters:
curi - the URI to check.
Returns:
true if ip should be looked up.

getRobotsValidityDuration

public long getRobotsValidityDuration(CrawlURI curi)
Get the maximum time a robots.txt is valid.

Parameters:
curi -
Returns:
the time a robots.txt is valid in milliseconds.

isRobotsExpired

public boolean isRobotsExpired(CrawlURI curi)
Is the robots policy expired. This method will also return true if we haven't tried to get the robots.txt for this server.

Parameters:
curi -
Returns:
true if the robots policy is expired.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.