|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Uses of Processor in org.archive.crawler.datamodel |
---|
Methods in org.archive.crawler.datamodel that return Processor | |
---|---|
Processor |
CrawlURI.nextProcessor()
Get the next processor to process this URI. |
Methods in org.archive.crawler.datamodel with parameters of type Processor | |
---|---|
void |
CrawlURI.setNextProcessor(Processor processor)
Set the next processor to process this URI. |
void |
CrawlURI.skipToProcessor(ProcessorChain processorChain,
Processor processor)
Set which processor should be the next processor to process this uri instead of using the default next processor. |
Uses of Processor in org.archive.crawler.extractor |
---|
Subclasses of Processor in org.archive.crawler.extractor | |
---|---|
class |
AggressiveExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp. |
class |
ChangeEvaluator
This processor compares the CrawlURI's current content digest
with the one from a previous crawl. |
class |
Extractor
Convenience shared superclass for Extractor Processors. |
class |
ExtractorCSS
This extractor is parsing URIs from CSS type files. |
class |
ExtractorDOC
This class allows the caller to extract href style links from word97-format word documents. |
class |
ExtractorHTML
Basic link-extraction, from an HTML content-body, using regular expressions. |
class |
ExtractorHTTP
Extracts URIs from HTTP response headers. |
class |
ExtractorImpliedURI
An extractor for finding 'implied' URIs inside other URIs. |
class |
ExtractorJS
Processes Javascript files for strings that are likely to be crawlable URIs. |
class |
ExtractorPDF
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs |
class |
ExtractorSWF
Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs. |
class |
ExtractorUniversal
A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link. |
class |
ExtractorURI
An extractor for finding URIs inside other URIs. |
class |
ExtractorXML
A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents) |
class |
HTTPContentDigest
A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors. |
class |
JerichoExtractorHTML
Improved link-extraction from an HTML content-body using jericho-html parser. |
class |
TrapSuppressExtractor
Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'. |
Uses of Processor in org.archive.crawler.fetcher |
---|
Subclasses of Processor in org.archive.crawler.fetcher | |
---|---|
class |
FetchDNS
Processor to resolve 'dns:' URIs. |
class |
FetchFTP
Fetches documents and directory listings using FTP. |
class |
FetchHTTP
HTTP fetcher that uses Apache Jakarta Commons HttpClient library. |
Uses of Processor in org.archive.crawler.framework |
---|
Subclasses of Processor in org.archive.crawler.framework | |
---|---|
class |
Scoper
Base class for Scopers. |
class |
WriterPoolProcessor
Abstract implementation of a file pool processor. |
Methods in org.archive.crawler.framework that return Processor | |
---|---|
Processor |
Processor.getDefaultNextProcessor(CrawlURI curi)
Returns the next processor for the given CrawlURI in the processor chain. |
Processor |
ProcessorChain.getFirstProcessor()
Get the first processor in the chain. |
Processor |
ProcessorChain.getProcessor(java.lang.Class classType)
Get the first processor that is of class classType or a
subclass of it. |
Processor |
Processor.spawn(int serialNum)
|
Methods in org.archive.crawler.framework with parameters of type Processor | |
---|---|
void |
Processor.setDefaultNextProcessor(Processor nextProcessor)
Set the default next processor in the chain. |
Uses of Processor in org.archive.crawler.postprocessor |
---|
Subclasses of Processor in org.archive.crawler.postprocessor | |
---|---|
class |
AcceptRevisitProcessor
Set a URI to be revisited by the ARFrontier. |
class |
ContentBasedWaitEvaluator
A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression. |
class |
CrawlStateUpdater
A step, late in the processing of a CrawlURI, for updating the per-host information that may have been affected by the fetch. |
class |
FrontierScheduler
'Schedule' with the Frontier CandidateURIs being carried by the passed CrawlURI. |
class |
ImageWaitEvaluator
A specialized ContentBasedWaitEvaluator. |
class |
LinksScoper
Determine which extracted links are within scope. |
class |
LowDiskPauseProcessor
Processor module which uses 'df -k', where available and with the expected output format (on Linux), to monitor available disk space and pause the crawl if free space on monitored filesystems falls below certain thresholds. |
class |
RejectRevisitProcessor
Set a URI to not be revisited by the ARFrontier. |
class |
SupplementaryLinksScoper
Run CandidateURI links carried in the passed CrawlURI through a filter and 'handle' rejections. |
class |
TextWaitEvaluator
A specialized ContentBasedWaitEvaluator. |
class |
WaitEvaluator
A processor that determines when a URI should be revisited next. |
Uses of Processor in org.archive.crawler.prefetch |
---|
Subclasses of Processor in org.archive.crawler.prefetch | |
---|---|
class |
PreconditionEnforcer
Ensures the preconditions for a fetch -- such as DNS lookup or acquiring and respecting a robots.txt policy -- are satisfied before a URI is passed to subsequent stages. |
class |
Preselector
If set to recheck the crawl's scope, gives a yes/no on whether a CrawlURI should be processed at all. |
class |
QuotaEnforcer
A simple quota enforcer. |
class |
RuntimeLimitEnforcer
A processor to enforce runtime limits on crawls. |
Uses of Processor in org.archive.crawler.processor |
---|
Subclasses of Processor in org.archive.crawler.processor | |
---|---|
class |
BeanShellProcessor
A processor which runs a BeanShell script on the CrawlURI. |
class |
CrawlMapper
A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). |
class |
HashCrawlMapper
Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey. |
class |
LexicalCrawlMapper
A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). |
Uses of Processor in org.archive.crawler.processor.recrawl |
---|
Subclasses of Processor in org.archive.crawler.processor.recrawl | |
---|---|
class |
FetchHistoryProcessor
Maintain a history of fetch information inside the CrawlURI's attributes. |
class |
PersistLoadProcessor
Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl. |
class |
PersistLogProcessor
Log CrawlURI attributes from latest fetch for consultation by a later recrawl. |
class |
PersistOnlineProcessor
Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later). |
class |
PersistProcessor
Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence. |
class |
PersistStoreProcessor
Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl. |
Uses of Processor in org.archive.crawler.writer |
---|
Subclasses of Processor in org.archive.crawler.writer | |
---|---|
class |
ARCWriterProcessor
Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format. |
class |
Kw3WriterProcessor
Processor module that writes the results of successful fetches to files on disk. |
class |
MirrorWriterProcessor
Processor module that writes the results of successful fetches to files on disk. |
class |
WARCWriterProcessor
WARCWriterProcessor. |
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |