org.archive.crawler.datamodel
Class RobotsHonoringPolicy

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.datamodel.RobotsHonoringPolicy
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean

public class RobotsHonoringPolicy
extends ModuleType

RobotsHonoringPolicy represent the strategy used by the crawler for determining how robots.txt files will be honored. Five kinds of policies exist:

classic:
obey the first set of robots.txt directives that apply to your current user-agent
ignore:
ignore robots.txt directives entirely
custom:
obey a specific operator-entered set of robots.txt directives for a given host
most-favored:
obey the most liberal restrictions offered (if *any* crawler is allowed to get a page, get it)
most-favored-set:
given some set of user-agent patterns, obey the most liberal restriction offered to any
The two last ones has the opportunity of adopting a different user-agent to reflect the restrictions we've opted to use.

Author:
John Erik Halse
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_CUSTOM_ROBOTS
           
static java.lang.String ATTR_MASQUERADE
           
static java.lang.String ATTR_NAME
           
static java.lang.String ATTR_TYPE
           
static java.lang.String ATTR_USER_AGENTS
           
static int CLASSIC
           
static int CUSTOM
           
static int IGNORE
           
static int MOST_FAVORED
           
static int MOST_FAVORED_SET
           
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
RobotsHonoringPolicy()
           
RobotsHonoringPolicy(java.lang.String name)
          Creates a new instance of RobotsHonoringPolicy.
 
Method Summary
 java.lang.String getCustomRobots(CrawlerSettings settings)
          Get the supplied custom robots.txt
 int getType(java.lang.Object context)
          Get the policy-type.
 StringList getUserAgents(CrawlerSettings settings)
          If policy-type is most favored crawler of set, then this method gets a list of all useragents in that set.
 boolean isType(java.lang.Object o, int type)
          Check if policy is of a certain type.
 boolean shouldMasquerade(CrawlURI curi)
          This method returns true if the crawler should masquerade as the user agent which restrictions it opted to use.
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

CLASSIC

public static final int CLASSIC
See Also:
Constant Field Values

IGNORE

public static final int IGNORE
See Also:
Constant Field Values

CUSTOM

public static final int CUSTOM
See Also:
Constant Field Values

MOST_FAVORED

public static final int MOST_FAVORED
See Also:
Constant Field Values

MOST_FAVORED_SET

public static final int MOST_FAVORED_SET
See Also:
Constant Field Values

ATTR_NAME

public static final java.lang.String ATTR_NAME
See Also:
Constant Field Values

ATTR_TYPE

public static final java.lang.String ATTR_TYPE
See Also:
Constant Field Values

ATTR_MASQUERADE

public static final java.lang.String ATTR_MASQUERADE
See Also:
Constant Field Values

ATTR_CUSTOM_ROBOTS

public static final java.lang.String ATTR_CUSTOM_ROBOTS
See Also:
Constant Field Values

ATTR_USER_AGENTS

public static final java.lang.String ATTR_USER_AGENTS
See Also:
Constant Field Values
Constructor Detail

RobotsHonoringPolicy

public RobotsHonoringPolicy(java.lang.String name)
Creates a new instance of RobotsHonoringPolicy.

Parameters:
name - the name of the RobotsHonoringPolicy attirubte.

RobotsHonoringPolicy

public RobotsHonoringPolicy()
Method Detail

getUserAgents

public StringList getUserAgents(CrawlerSettings settings)
If policy-type is most favored crawler of set, then this method gets a list of all useragents in that set.

Returns:
List of Strings with user agents

shouldMasquerade

public boolean shouldMasquerade(CrawlURI curi)
This method returns true if the crawler should masquerade as the user agent which restrictions it opted to use. (Only relevant for policy-types: most-favored and most-favored-set).

Returns:
true if we should masquerade

getCustomRobots

public java.lang.String getCustomRobots(CrawlerSettings settings)
Get the supplied custom robots.txt

Returns:
String with content of alternate robots.txt

getType

public int getType(java.lang.Object context)
Get the policy-type.

Returns:
policy type
See Also:
CLASSIC, IGNORE, CUSTOM, MOST_FAVORED, MOST_FAVORED_SET

isType

public boolean isType(java.lang.Object o,
                      int type)
Check if policy is of a certain type.

Parameters:
o - An object that can be resolved into a settings object.
type - the type to check against.
Returns:
true if the policy is of the submitted type


Copyright © 2003-2011 Internet Archive. All Rights Reserved.