|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Packages that use Extractor | |
---|---|
org.archive.crawler.extractor |
Uses of Extractor in org.archive.crawler.extractor |
---|
Subclasses of Extractor in org.archive.crawler.extractor | |
---|---|
class |
AggressiveExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp. |
class |
ExtractorCSS
This extractor is parsing URIs from CSS type files. |
class |
ExtractorDOC
This class allows the caller to extract href style links from word97-format word documents. |
class |
ExtractorHTML
Basic link-extraction, from an HTML content-body, using regular expressions. |
class |
ExtractorImpliedURI
An extractor for finding 'implied' URIs inside other URIs. |
class |
ExtractorJS
Processes Javascript files for strings that are likely to be crawlable URIs. |
class |
ExtractorPDF
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs |
class |
ExtractorSWF
Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs. |
class |
ExtractorUniversal
A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link. |
class |
ExtractorURI
An extractor for finding URIs inside other URIs. |
class |
ExtractorXML
A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents) |
class |
JerichoExtractorHTML
Improved link-extraction from an HTML content-body using jericho-html parser. |
class |
TrapSuppressExtractor
Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'. |
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |