org.archive.crawler.datamodel
Class RobotsExclusionPolicy
java.lang.Object
org.archive.crawler.datamodel.RobotsExclusionPolicy
- All Implemented Interfaces:
- java.io.Serializable
public class RobotsExclusionPolicy
- extends java.lang.Object
- implements java.io.Serializable
RobotsExclusionPolicy represents the actual policy adopted with
respect to a specific remote server, usually constructed from
consulting the robots.txt, if any, the server provided.
(The similarly named RobotsHonoringPolicy, on the other hand,
describes the strategy used by the crawler to determine to what
extent it respects exclusion rules.)
The expiration of policies after a suitable amount of time has
elapsed since last fetch is handled outside this class, in
CrawlServer itself.
TODO: refactor RobotsHonoringPolicy to be a class-per-policy, and
then see if a CrawlServer with a HonoringPolicy and a RobotsTxt
makes this mediating class unnecessary.
- Author:
- gojomo
- See Also:
- Serialized Form
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ALLOWALL
public static RobotsExclusionPolicy ALLOWALL
DENYALL
public static RobotsExclusionPolicy DENYALL
honoringPolicy
transient RobotsHonoringPolicy honoringPolicy
RobotsExclusionPolicy
public RobotsExclusionPolicy(CrawlerSettings settings,
Robotstxt robotstxt,
RobotsHonoringPolicy honoringPolicy)
- Parameters:
settings
- u
- d
- honoringPolicy
-
RobotsExclusionPolicy
public RobotsExclusionPolicy(int type)
policyFor
public static RobotsExclusionPolicy policyFor(CrawlerSettings settings,
java.io.BufferedReader reader,
RobotsHonoringPolicy honoringPolicy)
throws java.io.IOException
- Parameters:
settings
- reader
- honoringPolicy
-
- Returns:
- Robot exclusion policy.
- Throws:
java.io.IOException
disallows
public boolean disallows(CrawlURI curi,
java.lang.String userAgent)
getCrawlDelay
public float getCrawlDelay(java.lang.String userAgent)
- Get the crawl-delay that applies to the given user-agent, or
-1 (indicating no crawl-delay known) if not internal RobotsTxt
instance.
- Parameters:
userAgent
-
- Returns:
- int Crawl-Delay value, or -1 if non available
Copyright © 2003-2011 Internet Archive. All Rights Reserved.