org.archive.crawler.datamodel
Class RobotsHonoringPolicy
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.datamodel.RobotsHonoringPolicy
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean
public class RobotsHonoringPolicy
- extends ModuleType
RobotsHonoringPolicy represent the strategy used by the crawler
for determining how robots.txt files will be honored.
Five kinds of policies exist:
- classic:
- obey the first set of robots.txt directives that apply to your
current user-agent
- ignore:
- ignore robots.txt directives entirely
- custom:
- obey a specific operator-entered set of robots.txt directives
for a given host
- most-favored:
- obey the most liberal restrictions offered (if *any* crawler is
allowed to get a page, get it)
- most-favored-set:
- given some set of user-agent patterns, obey the most liberal
restriction offered to any
The two last ones has the opportunity of adopting a different user-agent
to reflect the restrictions we've opted to use.
- Author:
- John Erik Halse
- See Also:
- Serialized Form
Method Summary |
java.lang.String |
getCustomRobots(CrawlerSettings settings)
Get the supplied custom robots.txt |
int |
getType(java.lang.Object context)
Get the policy-type. |
StringList |
getUserAgents(CrawlerSettings settings)
If policy-type is most favored crawler of set, then this method
gets a list of all useragents in that set. |
boolean |
isType(java.lang.Object o,
int type)
Check if policy is of a certain type. |
boolean |
shouldMasquerade(CrawlURI curi)
This method returns true if the crawler should masquerade as the user agent
which restrictions it opted to use. |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
CLASSIC
public static final int CLASSIC
- See Also:
- Constant Field Values
IGNORE
public static final int IGNORE
- See Also:
- Constant Field Values
CUSTOM
public static final int CUSTOM
- See Also:
- Constant Field Values
MOST_FAVORED
public static final int MOST_FAVORED
- See Also:
- Constant Field Values
MOST_FAVORED_SET
public static final int MOST_FAVORED_SET
- See Also:
- Constant Field Values
ATTR_NAME
public static final java.lang.String ATTR_NAME
- See Also:
- Constant Field Values
ATTR_TYPE
public static final java.lang.String ATTR_TYPE
- See Also:
- Constant Field Values
ATTR_MASQUERADE
public static final java.lang.String ATTR_MASQUERADE
- See Also:
- Constant Field Values
ATTR_CUSTOM_ROBOTS
public static final java.lang.String ATTR_CUSTOM_ROBOTS
- See Also:
- Constant Field Values
ATTR_USER_AGENTS
public static final java.lang.String ATTR_USER_AGENTS
- See Also:
- Constant Field Values
RobotsHonoringPolicy
public RobotsHonoringPolicy(java.lang.String name)
- Creates a new instance of RobotsHonoringPolicy.
- Parameters:
name
- the name of the RobotsHonoringPolicy attirubte.
RobotsHonoringPolicy
public RobotsHonoringPolicy()
getUserAgents
public StringList getUserAgents(CrawlerSettings settings)
- If policy-type is most favored crawler of set, then this method
gets a list of all useragents in that set.
- Returns:
- List of Strings with user agents
shouldMasquerade
public boolean shouldMasquerade(CrawlURI curi)
- This method returns true if the crawler should masquerade as the user agent
which restrictions it opted to use.
(Only relevant for policy-types: most-favored and most-favored-set).
- Returns:
- true if we should masquerade
getCustomRobots
public java.lang.String getCustomRobots(CrawlerSettings settings)
- Get the supplied custom robots.txt
- Returns:
- String with content of alternate robots.txt
getType
public int getType(java.lang.Object context)
- Get the policy-type.
- Returns:
- policy type
- See Also:
CLASSIC
,
IGNORE
,
CUSTOM
,
MOST_FAVORED
,
MOST_FAVORED_SET
isType
public boolean isType(java.lang.Object o,
int type)
- Check if policy is of a certain type.
- Parameters:
o
- An object that can be resolved into a settings object.type
- the type to check against.
- Returns:
- true if the policy is of the submitted type
Copyright © 2003-2011 Internet Archive. All Rights Reserved.