RegexpHTMLLinkExtractor (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.extractor
Class RegexpHTMLLinkExtractor

java.lang.Object
  org.archive.extractor.CharSequenceLinkExtractor
      org.archive.extractor.RegexpHTMLLinkExtractor

All Implemented Interfaces:: java.util.Iterator, LinkExtractor

public class RegexpHTMLLinkExtractor
extends CharSequenceLinkExtractor
extends CharSequenceLinkExtractor

Basic link-extraction, from an HTML content-body, using regular expressions. ROUGH DRAFT IN PROGRESS / incomplete... untested...

Author:: gojomo

Field Summary
`(package private) static java.lang.String`	`AMP`
`(package private) static java.lang.String`	`APPLET`
`(package private) static java.lang.String`	`BASE`
`(package private) static java.lang.String`	`CLASSEXT`
`(package private) static java.lang.String`	`EACH_ATTRIBUTE_EXTRACTOR`
`(package private) static java.lang.String`	`ESCAPED_AMP`
`(package private) boolean`	`extractInlineCss`
`(package private) boolean`	`extractInlineJs`
`(package private) boolean`	`honorRobots`
`(package private) static java.lang.String`	`JAVASCRIPT`
`(package private) static java.lang.String`	`LIKELY_URI_PATH`
`(package private) static java.lang.String`	`LINK`
`protected java.util.LinkedList<Link>`	`next`
`(package private) static java.lang.String`	`NON_HTML_PATH_EXTENSION`
`(package private) static java.lang.String`	`RELEVANT_TAG_EXTRACTOR` Compiled relevant tag extractor.
`protected java.util.regex.Matcher`	`tags`
`(package private) static java.lang.String`	`WHITESPACE`

Fields inherited from class org.archive.extractor.CharSequenceLinkExtractor
`base, extractErrorListener, source, sourceContent`

Constructor Summary
`RegexpHTMLLinkExtractor()`

Method Summary
`protected boolean`	`findNextLink()` Scan to the next link(s), if any, loading it into the next buffer.
`protected static CharSequenceLinkExtractor`	`newDefaultInstance()`
`protected long`	`processEmbed(java.lang.CharSequence value, java.lang.CharSequence context)`
`protected boolean`	`processGeneralTag(java.lang.CharSequence element, java.lang.CharSequence cs)`
`protected void`	`processLink(java.lang.CharSequence value, java.lang.CharSequence context)`
`protected void`	`processMeta(java.lang.CharSequence cs)`
`protected void`	`processScript(java.lang.CharSequence sequence, int endOfOpenTag)`
`protected void`	`processScriptCode(java.lang.CharSequence cs)`
`protected void`	`processStyle(java.lang.CharSequence sequence, int endOfOpenTag)`
`void`	`reset()` Discard all state.

Methods inherited from class org.archive.extractor.CharSequenceLinkExtractor
`charSequenceFrom, createCharSequenceFrom, extract, hasNext, next, nextLink, remove, setup, setup, setup, setup`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

honorRobots

boolean honorRobots

extractInlineCss

boolean extractInlineCss

extractInlineJs

boolean extractInlineJs

protected java.util.LinkedList<Link> next

RELEVANT_TAG_EXTRACTOR

static final java.lang.String RELEVANT_TAG_EXTRACTOR

Compiled relevant tag extractor.

This pattern extracts either:

(1) whole <script>...</script> or

(2) <style>...</style> or

(3) <meta ...> or

(4) any other open-tag with at least one attribute (eg matches "<a href='boo'>" but not "</a>" or "<br>")

groups:

1: SCRIPT SRC=foo>boo</SCRIPT

2: just script open tag

3: STYLE TYPE=moo>zoo</STYLE

4: just style open tag

5: entire other tag, without '<' '>'

6: element

7: META

8: !-- comment --

See Also:: Constant Field Values

EACH_ATTRIBUTE_EXTRACTOR

static final java.lang.String EACH_ATTRIBUTE_EXTRACTOR

See Also:: Constant Field Values

LIKELY_URI_PATH

static final java.lang.String LIKELY_URI_PATH

See Also:: Constant Field Values

ESCAPED_AMP

static final java.lang.String ESCAPED_AMP

See Also:: Constant Field Values

AMP

static final java.lang.String AMP

See Also:: Constant Field Values

WHITESPACE

static final java.lang.String WHITESPACE

See Also:: Constant Field Values

CLASSEXT

static final java.lang.String CLASSEXT

See Also:: Constant Field Values

APPLET

static final java.lang.String APPLET

See Also:: Constant Field Values

BASE

static final java.lang.String BASE

See Also:: Constant Field Values

LINK

static final java.lang.String LINK

See Also:: Constant Field Values

JAVASCRIPT

static final java.lang.String JAVASCRIPT

See Also:: Constant Field Values

NON_HTML_PATH_EXTENSION

static final java.lang.String NON_HTML_PATH_EXTENSION

See Also:: Constant Field Values

Constructor Detail