Package org.archive.crawler.postprocessor

Class Summary
AcceptRevisitProcessor Set a URI to be revisited by the ARFrontier.
ContentBasedWaitEvaluator A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression.
CrawlStateUpdater A step, late in the processing of a CrawlURI, for updating the per-host information that may have been affected by the fetch.
FrontierScheduler 'Schedule' with the Frontier CandidateURIs being carried by the passed CrawlURI.
ImageWaitEvaluator A specialized ContentBasedWaitEvaluator.
LinksScoper Determine which extracted links are within scope.
LowDiskPauseProcessor Processor module which uses 'df -k', where available and with the expected output format (on Linux), to monitor available disk space and pause the crawl if free space on monitored filesystems falls below certain thresholds.
RejectRevisitProcessor Set a URI to not be revisited by the ARFrontier.
SupplementaryLinksScoper Run CandidateURI links carried in the passed CrawlURI through a filter and 'handle' rejections.
TextWaitEvaluator A specialized ContentBasedWaitEvaluator.
WaitEvaluator A processor that determines when a URI should be revisited next.
 



Copyright © 2003-2011 Internet Archive. All Rights Reserved.