org.archive.extractor
Class RegexpHTMLLinkExtractor
java.lang.Object
org.archive.extractor.CharSequenceLinkExtractor
org.archive.extractor.RegexpHTMLLinkExtractor
- All Implemented Interfaces:
- java.util.Iterator, LinkExtractor
public class RegexpHTMLLinkExtractor
- extends CharSequenceLinkExtractor
Basic link-extraction, from an HTML content-body,
using regular expressions.
ROUGH DRAFT IN PROGRESS / incomplete... untested...
- Author:
- gojomo
Methods inherited from class org.archive.extractor.CharSequenceLinkExtractor |
charSequenceFrom, createCharSequenceFrom, extract, hasNext, next, nextLink, remove, setup, setup, setup, setup |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
honorRobots
boolean honorRobots
extractInlineCss
boolean extractInlineCss
extractInlineJs
boolean extractInlineJs
next
protected java.util.LinkedList<Link> next
tags
protected java.util.regex.Matcher tags
RELEVANT_TAG_EXTRACTOR
static final java.lang.String RELEVANT_TAG_EXTRACTOR
- Compiled relevant tag extractor.
This pattern extracts either:
- (1) whole <script>...</script> or
- (2) <style>...</style> or
- (3) <meta ...> or
- (4) any other open-tag with at least one attribute
(eg matches "<a href='boo'>" but not "</a>" or "<br>")
groups:
- 1: SCRIPT SRC=foo>boo</SCRIPT
- 2: just script open tag
- 3: STYLE TYPE=moo>zoo</STYLE
- 4: just style open tag
- 5: entire other tag, without '<' '>'
- 6: element
- 7: META
- 8: !-- comment --
- See Also:
- Constant Field Values
EACH_ATTRIBUTE_EXTRACTOR
static final java.lang.String EACH_ATTRIBUTE_EXTRACTOR
- See Also:
- Constant Field Values
LIKELY_URI_PATH
static final java.lang.String LIKELY_URI_PATH
- See Also:
- Constant Field Values
ESCAPED_AMP
static final java.lang.String ESCAPED_AMP
- See Also:
- Constant Field Values
AMP
static final java.lang.String AMP
- See Also:
- Constant Field Values
WHITESPACE
static final java.lang.String WHITESPACE
- See Also:
- Constant Field Values
CLASSEXT
static final java.lang.String CLASSEXT
- See Also:
- Constant Field Values
APPLET
static final java.lang.String APPLET
- See Also:
- Constant Field Values
BASE
static final java.lang.String BASE
- See Also:
- Constant Field Values
LINK
static final java.lang.String LINK
- See Also:
- Constant Field Values
JAVASCRIPT
static final java.lang.String JAVASCRIPT
- See Also:
- Constant Field Values
NON_HTML_PATH_EXTENSION
static final java.lang.String NON_HTML_PATH_EXTENSION
- See Also:
- Constant Field Values
RegexpHTMLLinkExtractor
public RegexpHTMLLinkExtractor()
findNextLink
protected boolean findNextLink()
- Description copied from class:
CharSequenceLinkExtractor
- Scan to the next link(s), if any, loading it into the next buffer.
- Specified by:
findNextLink
in class CharSequenceLinkExtractor
- Returns:
- true if any links are found/available, false otherwise
processGeneralTag
protected boolean processGeneralTag(java.lang.CharSequence element,
java.lang.CharSequence cs)
processScriptCode
protected void processScriptCode(java.lang.CharSequence cs)
- Parameters:
cs
-
processLink
protected void processLink(java.lang.CharSequence value,
java.lang.CharSequence context)
- Parameters:
value
- context
-
processEmbed
protected long processEmbed(java.lang.CharSequence value,
java.lang.CharSequence context)
processScript
protected void processScript(java.lang.CharSequence sequence,
int endOfOpenTag)
processMeta
protected void processMeta(java.lang.CharSequence cs)
processStyle
protected void processStyle(java.lang.CharSequence sequence,
int endOfOpenTag)
- Parameters:
sequence
- endOfOpenTag
-
reset
public void reset()
- Discard all state. Another setup() is required to use again.
- Specified by:
reset
in interface LinkExtractor
- Overrides:
reset
in class CharSequenceLinkExtractor
newDefaultInstance
protected static CharSequenceLinkExtractor newDefaultInstance()
Copyright © 2003-2011 Internet Archive. All Rights Reserved.