Package org.archive.crawler.frontier

Interface Summary
AdaptiveRevisitAttributeConstants Defines static constants for the Adaptive Revisiting module defining data keys in the CrawlURI AList.
FrontierJournal Record of key Frontier happenings.
 

Class Summary
AbstractFrontier Shared facilities for Frontier implementations.
AdaptiveRevisitFrontier A Frontier that will repeatedly visit all encountered URIs.
AdaptiveRevisitHostQueue A priority based queue of CrawlURIs.
AdaptiveRevisitQueueList Maintains an ordered list of AdaptiveRevisitHostQueues used by a Frontier.
AntiCalendarCostAssignmentPolicy CostAssignmentPolicy that further penalizes URIs with calendar-suggestive strings in them, with an extra unit of cost.
BdbFrontier A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.
BdbMultipleWorkQueues A BerkeleyDB-database-backed structure for holding ordered groupings of CrawlURIs.
BdbWorkQueue One independent queue of items with the same 'classKey' (eg host).
BucketQueueAssignmentPolicy Uses the target IPs as basis for queue-assignment, distributing them over a fixed number of sub-queues.
CostAssignmentPolicy Calculate a integer 'cost' value for the given CrawlURI.
DomainSensitiveFrontier Deprecated. As of release 1.10.0.
HostnameQueueAssignmentPolicy QueueAssignmentPolicy based on the hostname:port evident in the given CrawlURI.
IPQueueAssignmentPolicy Uses target IP as basis for queue-assignment, unless it is unavailable, in which case it behaves as HostnameQueueAssignmentPolicy.
QueueAssignmentPolicy Establishes a mapping from CrawlURIs to String keys (queue names).
RecoveryJournal Helper class for managing a simple Frontier change-events journal which is useful for recovering from crawl problems.
RecyclingSerialBinding A SerialBinding that recycles a single FastOutputStream per thread, avoiding reallocation of the internal buffer for either repeated serializations or because of mid-serialization expansions.
SurtAuthorityQueueAssignmentPolicy SurtAuthorityQueueAssignmentPolicy based on the surt form of hostname.
TopmostAssignedSurtQueueAssignmentPolicy Create a queueKey based on the SURT authority, reduced to the public-suffix-plus-one domain (topmost assignable domain).
UnitCostAssignmentPolicy A CostAssignment policy that uses a constant value of 1 for all CrawlURIs.
WagCostAssignmentPolicy A CostAssignmentPolicy based on some wild guesses of kinds of URIs that should be deferred into the (potentially never-crawled) future.
WorkQueue A single queue of related URIs to visit, grouped by a classKey (typically "hostname:port" or similar)
WorkQueueFrontier A common Frontier base using several queues to hold pending URIs.
ZeroCostAssignmentPolicy CostAssignmentPolicy considering all URIs costless -- essentially disabling budgetting features.
 



Copyright © 2003-2011 Internet Archive. All Rights Reserved.