Uses of Class org.archive.crawler.framework.Processor (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Uses of Class
org.archive.crawler.framework.Processor

Packages that use Processor
org.archive.crawler.datamodel
org.archive.crawler.extractor
org.archive.crawler.fetcher
org.archive.crawler.framework
org.archive.crawler.postprocessor
org.archive.crawler.prefetch
org.archive.crawler.processor
org.archive.crawler.processor.recrawl
org.archive.crawler.writer

Uses of Processor in org.archive.crawler.datamodel

Methods in org.archive.crawler.datamodel that return Processor
`Processor`	`CrawlURI.nextProcessor()` Get the next processor to process this URI.

Methods in org.archive.crawler.datamodel with parameters of type Processor
`void`	`CrawlURI.setNextProcessor(Processor processor)` Set the next processor to process this URI.
`void`	`CrawlURI.skipToProcessor(ProcessorChain processorChain, Processor processor)` Set which processor should be the next processor to process this uri instead of using the default next processor.

Uses of Processor in org.archive.crawler.extractor

Subclasses of Processor in org.archive.crawler.extractor
`class`	`AggressiveExtractorHTML` Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp.
`class`	`ChangeEvaluator` This processor compares the CrawlURI's current `content digest` with the one from a previous crawl.
`class`	`Extractor` Convenience shared superclass for Extractor Processors.
`class`	`ExtractorCSS` This extractor is parsing URIs from CSS type files.
`class`	`ExtractorDOC` This class allows the caller to extract href style links from word97-format word documents.
`class`	`ExtractorHTML` Basic link-extraction, from an HTML content-body, using regular expressions.
`class`	`ExtractorHTTP` Extracts URIs from HTTP response headers.
`class`	`ExtractorImpliedURI` An extractor for finding 'implied' URIs inside other URIs.
`class`	`ExtractorJS` Processes Javascript files for strings that are likely to be crawlable URIs.
`class`	`ExtractorPDF` Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
`class`	`ExtractorSWF` Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs.
`class`	`ExtractorUniversal` A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
`class`	`ExtractorURI` An extractor for finding URIs inside other URIs.
`class`	`ExtractorXML` A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents)
`class`	`HTTPContentDigest` A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
`class`	`JerichoExtractorHTML` Improved link-extraction from an HTML content-body using jericho-html parser.
`class`	`TrapSuppressExtractor` Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.

Uses of Processor in org.archive.crawler.fetcher

Subclasses of Processor in org.archive.crawler.fetcher
`class`	`FetchDNS` Processor to resolve 'dns:' URIs.
`class`	`FetchFTP` Fetches documents and directory listings using FTP.
`class`	`FetchHTTP` HTTP fetcher that uses Apache Jakarta Commons HttpClient library.

Uses of Processor in org.archive.crawler.framework

Subclasses of Processor in org.archive.crawler.framework
`class`	`Scoper` Base class for Scopers.
`class`	`WriterPoolProcessor` Abstract implementation of a file pool processor.

Methods in org.archive.crawler.framework that return Processor
`Processor`	`Processor.getDefaultNextProcessor(CrawlURI curi)` Returns the next processor for the given CrawlURI in the processor chain.
`Processor`	`ProcessorChain.getFirstProcessor()` Get the first processor in the chain.
`Processor`	`ProcessorChain.getProcessor(java.lang.Class classType)` Get the first processor that is of class `classType` or a subclass of it.
`Processor`	`Processor.spawn(int serialNum)`

Methods in org.archive.crawler.framework with parameters of type Processor
`void`	`Processor.setDefaultNextProcessor(Processor nextProcessor)` Set the default next processor in the chain.

Uses of Processor in org.archive.crawler.postprocessor

Subclasses of Processor in org.archive.crawler.postprocessor
`class`	`AcceptRevisitProcessor` Set a URI to be revisited by the ARFrontier.
`class`	`ContentBasedWaitEvaluator` A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression.
`class`	`CrawlStateUpdater` A step, late in the processing of a CrawlURI, for updating the per-host information that may have been affected by the fetch.
`class`	`FrontierScheduler` 'Schedule' with the Frontier CandidateURIs being carried by the passed CrawlURI.
`class`	`ImageWaitEvaluator` A specialized ContentBasedWaitEvaluator.
`class`	`LinksScoper` Determine which extracted links are within scope.
`class`	`LowDiskPauseProcessor` Processor module which uses 'df -k', where available and with the expected output format (on Linux), to monitor available disk space and pause the crawl if free space on monitored filesystems falls below certain thresholds.
`class`	`RejectRevisitProcessor` Set a URI to not be revisited by the ARFrontier.
`class`	`SupplementaryLinksScoper` Run CandidateURI links carried in the passed CrawlURI through a filter and 'handle' rejections.
`class`	`TextWaitEvaluator` A specialized ContentBasedWaitEvaluator.
`class`	`WaitEvaluator` A processor that determines when a URI should be revisited next.

Uses of Processor in org.archive.crawler.prefetch

Subclasses of Processor in org.archive.crawler.prefetch
`class`	`PreconditionEnforcer` Ensures the preconditions for a fetch -- such as DNS lookup or acquiring and respecting a robots.txt policy -- are satisfied before a URI is passed to subsequent stages.
`class`	`Preselector` If set to recheck the crawl's scope, gives a yes/no on whether a CrawlURI should be processed at all.
`class`	`QuotaEnforcer` A simple quota enforcer.
`class`	`RuntimeLimitEnforcer` A processor to enforce runtime limits on crawls.

Uses of Processor in org.archive.crawler.processor

Subclasses of Processor in org.archive.crawler.processor
`class`	`BeanShellProcessor` A processor which runs a BeanShell script on the CrawlURI.
`class`	`CrawlMapper` A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).
`class`	`HashCrawlMapper` Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey.
`class`	`LexicalCrawlMapper` A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).

Uses of Processor in org.archive.crawler.processor.recrawl

Subclasses of Processor in org.archive.crawler.processor.recrawl
`class`	`FetchHistoryProcessor` Maintain a history of fetch information inside the CrawlURI's attributes.
`class`	`PersistLoadProcessor` Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
`class`	`PersistLogProcessor` Log CrawlURI attributes from latest fetch for consultation by a later recrawl.
`class`	`PersistOnlineProcessor` Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later).
`class`	`PersistProcessor` Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence.
`class`	`PersistStoreProcessor` Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.

Uses of Processor in org.archive.crawler.writer

Subclasses of Processor in org.archive.crawler.writer
`class`	`ARCWriterProcessor` Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format.
`class`	`Kw3WriterProcessor` Processor module that writes the results of successful fetches to files on disk.
`class`	`MirrorWriterProcessor` Processor module that writes the results of successful fetches to files on disk.
`class`	`WARCWriterProcessor` WARCWriterProcessor.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Copyright © 2003-2011 Internet Archive. All Rights Reserved.