Uses of Class
org.archive.crawler.framework.Processor

Packages that use Processor
org.archive.crawler.datamodel   
org.archive.crawler.extractor   
org.archive.crawler.fetcher   
org.archive.crawler.framework   
org.archive.crawler.postprocessor   
org.archive.crawler.prefetch   
org.archive.crawler.processor   
org.archive.crawler.processor.recrawl   
org.archive.crawler.writer   
 

Uses of Processor in org.archive.crawler.datamodel
 

Methods in org.archive.crawler.datamodel that return Processor
 Processor CrawlURI.nextProcessor()
          Get the next processor to process this URI.
 

Methods in org.archive.crawler.datamodel with parameters of type Processor
 void CrawlURI.setNextProcessor(Processor processor)
          Set the next processor to process this URI.
 void CrawlURI.skipToProcessor(ProcessorChain processorChain, Processor processor)
          Set which processor should be the next processor to process this uri instead of using the default next processor.
 

Uses of Processor in org.archive.crawler.extractor
 

Subclasses of Processor in org.archive.crawler.extractor
 class AggressiveExtractorHTML
          Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp.
 class ChangeEvaluator
          This processor compares the CrawlURI's current content digest with the one from a previous crawl.
 class Extractor
          Convenience shared superclass for Extractor Processors.
 class ExtractorCSS
          This extractor is parsing URIs from CSS type files.
 class ExtractorDOC
          This class allows the caller to extract href style links from word97-format word documents.
 class ExtractorHTML
          Basic link-extraction, from an HTML content-body, using regular expressions.
 class ExtractorHTTP
          Extracts URIs from HTTP response headers.
 class ExtractorImpliedURI
          An extractor for finding 'implied' URIs inside other URIs.
 class ExtractorJS
          Processes Javascript files for strings that are likely to be crawlable URIs.
 class ExtractorPDF
          Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
 class ExtractorSWF
          Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs.
 class ExtractorUniversal
          A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
 class ExtractorURI
          An extractor for finding URIs inside other URIs.
 class ExtractorXML
          A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents)
 class HTTPContentDigest
          A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
 class JerichoExtractorHTML
          Improved link-extraction from an HTML content-body using jericho-html parser.
 class TrapSuppressExtractor
          Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.
 

Uses of Processor in org.archive.crawler.fetcher
 

Subclasses of Processor in org.archive.crawler.fetcher
 class FetchDNS
          Processor to resolve 'dns:' URIs.
 class FetchFTP
          Fetches documents and directory listings using FTP.
 class FetchHTTP
          HTTP fetcher that uses Apache Jakarta Commons HttpClient library.
 

Uses of Processor in org.archive.crawler.framework
 

Subclasses of Processor in org.archive.crawler.framework
 class Scoper
          Base class for Scopers.
 class WriterPoolProcessor
          Abstract implementation of a file pool processor.
 

Methods in org.archive.crawler.framework that return Processor
 Processor Processor.getDefaultNextProcessor(CrawlURI curi)
          Returns the next processor for the given CrawlURI in the processor chain.
 Processor ProcessorChain.getFirstProcessor()
          Get the first processor in the chain.
 Processor ProcessorChain.getProcessor(java.lang.Class classType)
          Get the first processor that is of class classType or a subclass of it.
 Processor Processor.spawn(int serialNum)
           
 

Methods in org.archive.crawler.framework with parameters of type Processor
 void Processor.setDefaultNextProcessor(Processor nextProcessor)
          Set the default next processor in the chain.
 

Uses of Processor in org.archive.crawler.postprocessor
 

Subclasses of Processor in org.archive.crawler.postprocessor
 class AcceptRevisitProcessor
          Set a URI to be revisited by the ARFrontier.
 class ContentBasedWaitEvaluator
          A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression.
 class CrawlStateUpdater
          A step, late in the processing of a CrawlURI, for updating the per-host information that may have been affected by the fetch.
 class FrontierScheduler
          'Schedule' with the Frontier CandidateURIs being carried by the passed CrawlURI.
 class ImageWaitEvaluator
          A specialized ContentBasedWaitEvaluator.
 class LinksScoper
          Determine which extracted links are within scope.
 class LowDiskPauseProcessor
          Processor module which uses 'df -k', where available and with the expected output format (on Linux), to monitor available disk space and pause the crawl if free space on monitored filesystems falls below certain thresholds.
 class RejectRevisitProcessor
          Set a URI to not be revisited by the ARFrontier.
 class SupplementaryLinksScoper
          Run CandidateURI links carried in the passed CrawlURI through a filter and 'handle' rejections.
 class TextWaitEvaluator
          A specialized ContentBasedWaitEvaluator.
 class WaitEvaluator
          A processor that determines when a URI should be revisited next.
 

Uses of Processor in org.archive.crawler.prefetch
 

Subclasses of Processor in org.archive.crawler.prefetch
 class PreconditionEnforcer
          Ensures the preconditions for a fetch -- such as DNS lookup or acquiring and respecting a robots.txt policy -- are satisfied before a URI is passed to subsequent stages.
 class Preselector
          If set to recheck the crawl's scope, gives a yes/no on whether a CrawlURI should be processed at all.
 class QuotaEnforcer
          A simple quota enforcer.
 class RuntimeLimitEnforcer
          A processor to enforce runtime limits on crawls.
 

Uses of Processor in org.archive.crawler.processor
 

Subclasses of Processor in org.archive.crawler.processor
 class BeanShellProcessor
          A processor which runs a BeanShell script on the CrawlURI.
 class CrawlMapper
          A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).
 class HashCrawlMapper
          Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey.
 class LexicalCrawlMapper
          A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).
 

Uses of Processor in org.archive.crawler.processor.recrawl
 

Subclasses of Processor in org.archive.crawler.processor.recrawl
 class FetchHistoryProcessor
          Maintain a history of fetch information inside the CrawlURI's attributes.
 class PersistLoadProcessor
          Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
 class PersistLogProcessor
          Log CrawlURI attributes from latest fetch for consultation by a later recrawl.
 class PersistOnlineProcessor
          Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later).
 class PersistProcessor
          Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence.
 class PersistStoreProcessor
          Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
 

Uses of Processor in org.archive.crawler.writer
 

Subclasses of Processor in org.archive.crawler.writer
 class ARCWriterProcessor
          Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format.
 class Kw3WriterProcessor
          Processor module that writes the results of successful fetches to files on disk.
 class MirrorWriterProcessor
          Processor module that writes the results of successful fetches to files on disk.
 class WARCWriterProcessor
          WARCWriterProcessor.
 



Copyright © 2003-2011 Internet Archive. All Rights Reserved.