Class Summary |
AggressiveExtractorHTML |
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regexp, and than by javascript speculative link regexp. |
ChangeEvaluator |
This processor compares the CrawlURI's current
content digest
with the one from a previous crawl. |
CrawlUriSWFAction |
SWF action that handles discovered URIs. |
CustomSWFTags |
Overwrite action tags, that may hold URI, to use CrawlUriSWFAction
action. |
Extractor |
Convenience shared superclass for Extractor Processors. |
ExtractorCSS |
This extractor is parsing URIs from CSS type files. |
ExtractorDOC |
This class allows the caller to extract href style links from word97-format word documents. |
ExtractorHTML |
Basic link-extraction, from an HTML content-body,
using regular expressions. |
ExtractorHTTP |
Extracts URIs from HTTP response headers. |
ExtractorImpliedURI |
An extractor for finding 'implied' URIs inside other URIs. |
ExtractorJS |
Processes Javascript files for strings that are likely to be
crawlable URIs. |
ExtractorPDF |
Allows the caller to process a CrawlURI representing a PDF
for the purpose of extracting URIs |
ExtractorSWF |
Process SWF (flash/shockwave) files for strings that are likely to be
crawlable URIs. |
ExtractorTool |
Run named extractors against passed ARC file. |
ExtractorUniversal |
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link. |
ExtractorURI |
An extractor for finding URIs inside other URIs. |
ExtractorXML |
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents) |
HTTPContentDigest |
A processor for calculating custum HTTP content digests in place of the
default (if any) computed by the HTTP fetcher processors. |
JerichoExtractorHTML |
Improved link-extraction from an HTML content-body using jericho-html parser. |
Link |
Link represents one discovered "edge" of the web graph: the source
URI, the destination URI, and the type of reference (represented by the
context in which it was found). |
PDFParser |
Supports PDF parsing operations. |
TrapSuppressExtractor |
Pseudo-extractor that suppresses link-extraction of likely trap pages,
by noticing when content's digest is identical to that of its 'via'. |