Package org.archive.crawler.extractor

Class Summary
AggressiveExtractorHTML Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp.
ChangeEvaluator This processor compares the CrawlURI's current content digest with the one from a previous crawl.
CrawlUriSWFAction SWF action that handles discovered URIs.
CustomSWFTags Overwrite action tags, that may hold URI, to use CrawlUriSWFAction action.
Extractor Convenience shared superclass for Extractor Processors.
ExtractorCSS This extractor is parsing URIs from CSS type files.
ExtractorDOC This class allows the caller to extract href style links from word97-format word documents.
ExtractorHTML Basic link-extraction, from an HTML content-body, using regular expressions.
ExtractorHTTP Extracts URIs from HTTP response headers.
ExtractorImpliedURI An extractor for finding 'implied' URIs inside other URIs.
ExtractorJS Processes Javascript files for strings that are likely to be crawlable URIs.
ExtractorPDF Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
ExtractorSWF Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs.
ExtractorTool Run named extractors against passed ARC file.
ExtractorUniversal A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
ExtractorURI An extractor for finding URIs inside other URIs.
ExtractorXML A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents)
HTTPContentDigest A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
JerichoExtractorHTML Improved link-extraction from an HTML content-body using jericho-html parser.
Link Link represents one discovered "edge" of the web graph: the source URI, the destination URI, and the type of reference (represented by the context in which it was found).
PDFParser Supports PDF parsing operations.
TrapSuppressExtractor Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.
 



Copyright © 2003-2011 Internet Archive. All Rights Reserved.