Package org.archive.crawler.datamodel

Interface Summary
CoreAttributeConstants CrawlURI attribute keys used by the core crawler classes.
CrawlSubstats.HasCrawlSubstats  
FetchStatusCodes Constant flag codes to be used, in lieu of per-protocol codes (like HTTP's 200, 404, etc.), when network/internal/ out-of-band conditions occur.
InstancePerThread indicates that a processor should have an instance per ToeThread
UriUniqFilter A UriUniqFilter passes URI objects to a destination (receiver) if the passed URI object has not been previously seen.
UriUniqFilter.HasUriReceiver URIs that have not been seen before 'visit' this 'Visitor'.
 

Class Summary
CandidateURI A URI, discovered or passed-in, that may be scheduled.
Checkpoint Record of a specific checkpoint on disk.
CrawlHost Represents a single remote "host".
CrawlOrder Represents the 'root' of the settings hierarchy.
CrawlServer Represents a single remote "server".
CrawlSubstats Collector of statistics for a 'subset' of a crawl, such as a server (host:port), host, or frontier group (eg queue).
CrawlURI Represents a candidate URI and the associated state it collects as it is crawled.
CredentialStore Front door to the credential store.
LocalizedError  
RobotsDirectives Represents the directives that apply to a user-agent (or set of user-agents)
RobotsExclusionPolicy RobotsExclusionPolicy represents the actual policy adopted with respect to a specific remote server, usually constructed from consulting the robots.txt, if any, the server provided.
RobotsHonoringPolicy RobotsHonoringPolicy represent the strategy used by the crawler for determining how robots.txt files will be honored.
Robotstxt Utility class for parsing and representing 'robots.txt' format directives, into a list of named user-agents and map from user-agents to RobotsDirectives.
ServerCache Server and Host cache.
 

Enum Summary
CrawlSubstats.Stage  
 



Copyright © 2003-2011 Internet Archive. All Rights Reserved.