org.archive.crawler.extractor (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package org.archive.crawler.extractor

Class Summary
AggressiveExtractorHTML	Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp.
ChangeEvaluator	This processor compares the CrawlURI's current `content digest` with the one from a previous crawl.
CrawlUriSWFAction	SWF action that handles discovered URIs.
CustomSWFTags	Overwrite action tags, that may hold URI, to use `CrawlUriSWFAction action.`
Extractor	Convenience shared superclass for Extractor Processors.
ExtractorCSS	This extractor is parsing URIs from CSS type files.
ExtractorDOC	This class allows the caller to extract href style links from word97-format word documents.
ExtractorHTML	Basic link-extraction, from an HTML content-body, using regular expressions.
ExtractorHTTP	Extracts URIs from HTTP response headers.
ExtractorImpliedURI	An extractor for finding 'implied' URIs inside other URIs.
ExtractorJS	Processes Javascript files for strings that are likely to be crawlable URIs.
ExtractorPDF	Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
ExtractorSWF	Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs.
ExtractorTool	Run named extractors against passed ARC file.
ExtractorUniversal	A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
ExtractorURI	An extractor for finding URIs inside other URIs.
ExtractorXML	A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents)
HTTPContentDigest	A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
JerichoExtractorHTML	Improved link-extraction from an HTML content-body using jericho-html parser.
Link	Link represents one discovered "edge" of the web graph: the source URI, the destination URI, and the type of reference (represented by the context in which it was found).
PDFParser	Supports PDF parsing operations.
TrapSuppressExtractor	Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES