org.archive.crawler.datamodel
Class RobotsExclusionPolicy

java.lang.Object
  extended by org.archive.crawler.datamodel.RobotsExclusionPolicy
All Implemented Interfaces:
java.io.Serializable

public class RobotsExclusionPolicy
extends java.lang.Object
implements java.io.Serializable

RobotsExclusionPolicy represents the actual policy adopted with respect to a specific remote server, usually constructed from consulting the robots.txt, if any, the server provided. (The similarly named RobotsHonoringPolicy, on the other hand, describes the strategy used by the crawler to determine to what extent it respects exclusion rules.) The expiration of policies after a suitable amount of time has elapsed since last fetch is handled outside this class, in CrawlServer itself. TODO: refactor RobotsHonoringPolicy to be a class-per-policy, and then see if a CrawlServer with a HonoringPolicy and a RobotsTxt makes this mediating class unnecessary.

Author:
gojomo
See Also:
Serialized Form

Field Summary
static RobotsExclusionPolicy ALLOWALL
           
static RobotsExclusionPolicy DENYALL
           
(package private)  RobotsHonoringPolicy honoringPolicy
           
 
Constructor Summary
RobotsExclusionPolicy(CrawlerSettings settings, Robotstxt robotstxt, RobotsHonoringPolicy honoringPolicy)
           
RobotsExclusionPolicy(int type)
           
 
Method Summary
 boolean disallows(CrawlURI curi, java.lang.String userAgent)
           
 float getCrawlDelay(java.lang.String userAgent)
          Get the crawl-delay that applies to the given user-agent, or -1 (indicating no crawl-delay known) if not internal RobotsTxt instance.
static RobotsExclusionPolicy policyFor(CrawlerSettings settings, java.io.BufferedReader reader, RobotsHonoringPolicy honoringPolicy)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ALLOWALL

public static RobotsExclusionPolicy ALLOWALL

DENYALL

public static RobotsExclusionPolicy DENYALL

honoringPolicy

transient RobotsHonoringPolicy honoringPolicy
Constructor Detail

RobotsExclusionPolicy

public RobotsExclusionPolicy(CrawlerSettings settings,
                             Robotstxt robotstxt,
                             RobotsHonoringPolicy honoringPolicy)
Parameters:
settings -
u -
d -
honoringPolicy -

RobotsExclusionPolicy

public RobotsExclusionPolicy(int type)
Method Detail

policyFor

public static RobotsExclusionPolicy policyFor(CrawlerSettings settings,
                                              java.io.BufferedReader reader,
                                              RobotsHonoringPolicy honoringPolicy)
                                       throws java.io.IOException
Parameters:
settings -
reader -
honoringPolicy -
Returns:
Robot exclusion policy.
Throws:
java.io.IOException

disallows

public boolean disallows(CrawlURI curi,
                         java.lang.String userAgent)

getCrawlDelay

public float getCrawlDelay(java.lang.String userAgent)
Get the crawl-delay that applies to the given user-agent, or -1 (indicating no crawl-delay known) if not internal RobotsTxt instance.

Parameters:
userAgent -
Returns:
int Crawl-Delay value, or -1 if non available


Copyright © 2003-2011 Internet Archive. All Rights Reserved.