org.archive.extractor
Class RegexpHTMLLinkExtractor

java.lang.Object
  extended by org.archive.extractor.CharSequenceLinkExtractor
      extended by org.archive.extractor.RegexpHTMLLinkExtractor
All Implemented Interfaces:
java.util.Iterator, LinkExtractor

public class RegexpHTMLLinkExtractor
extends CharSequenceLinkExtractor

Basic link-extraction, from an HTML content-body, using regular expressions. ROUGH DRAFT IN PROGRESS / incomplete... untested...

Author:
gojomo

Field Summary
(package private) static java.lang.String AMP
           
(package private) static java.lang.String APPLET
           
(package private) static java.lang.String BASE
           
(package private) static java.lang.String CLASSEXT
           
(package private) static java.lang.String EACH_ATTRIBUTE_EXTRACTOR
           
(package private) static java.lang.String ESCAPED_AMP
           
(package private)  boolean extractInlineCss
           
(package private)  boolean extractInlineJs
           
(package private)  boolean honorRobots
           
(package private) static java.lang.String JAVASCRIPT
           
(package private) static java.lang.String LIKELY_URI_PATH
           
(package private) static java.lang.String LINK
           
protected  java.util.LinkedList<Link> next
           
(package private) static java.lang.String NON_HTML_PATH_EXTENSION
           
(package private) static java.lang.String RELEVANT_TAG_EXTRACTOR
          Compiled relevant tag extractor.
protected  java.util.regex.Matcher tags
           
(package private) static java.lang.String WHITESPACE
           
 
Fields inherited from class org.archive.extractor.CharSequenceLinkExtractor
base, extractErrorListener, source, sourceContent
 
Constructor Summary
RegexpHTMLLinkExtractor()
           
 
Method Summary
protected  boolean findNextLink()
          Scan to the next link(s), if any, loading it into the next buffer.
protected static CharSequenceLinkExtractor newDefaultInstance()
           
protected  long processEmbed(java.lang.CharSequence value, java.lang.CharSequence context)
           
protected  boolean processGeneralTag(java.lang.CharSequence element, java.lang.CharSequence cs)
           
protected  void processLink(java.lang.CharSequence value, java.lang.CharSequence context)
           
protected  void processMeta(java.lang.CharSequence cs)
           
protected  void processScript(java.lang.CharSequence sequence, int endOfOpenTag)
           
protected  void processScriptCode(java.lang.CharSequence cs)
           
protected  void processStyle(java.lang.CharSequence sequence, int endOfOpenTag)
           
 void reset()
          Discard all state.
 
Methods inherited from class org.archive.extractor.CharSequenceLinkExtractor
charSequenceFrom, createCharSequenceFrom, extract, hasNext, next, nextLink, remove, setup, setup, setup, setup
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

honorRobots

boolean honorRobots

extractInlineCss

boolean extractInlineCss

extractInlineJs

boolean extractInlineJs

next

protected java.util.LinkedList<Link> next

tags

protected java.util.regex.Matcher tags

RELEVANT_TAG_EXTRACTOR

static final java.lang.String RELEVANT_TAG_EXTRACTOR
Compiled relevant tag extractor.

This pattern extracts either:

  • (1) whole <script>...</script> or
  • (2) <style>...</style> or
  • (3) <meta ...> or
  • (4) any other open-tag with at least one attribute (eg matches "<a href='boo'>" but not "</a>" or "<br>")

    groups:

  • 1: SCRIPT SRC=foo>boo</SCRIPT
  • 2: just script open tag
  • 3: STYLE TYPE=moo>zoo</STYLE
  • 4: just style open tag
  • 5: entire other tag, without '<' '>'
  • 6: element
  • 7: META
  • 8: !-- comment --

    See Also:
    Constant Field Values

  • EACH_ATTRIBUTE_EXTRACTOR

    static final java.lang.String EACH_ATTRIBUTE_EXTRACTOR
    See Also:
    Constant Field Values

    LIKELY_URI_PATH

    static final java.lang.String LIKELY_URI_PATH
    See Also:
    Constant Field Values

    ESCAPED_AMP

    static final java.lang.String ESCAPED_AMP
    See Also:
    Constant Field Values

    AMP

    static final java.lang.String AMP
    See Also:
    Constant Field Values

    WHITESPACE

    static final java.lang.String WHITESPACE
    See Also:
    Constant Field Values

    CLASSEXT

    static final java.lang.String CLASSEXT
    See Also:
    Constant Field Values

    APPLET

    static final java.lang.String APPLET
    See Also:
    Constant Field Values

    BASE

    static final java.lang.String BASE
    See Also:
    Constant Field Values

    LINK

    static final java.lang.String LINK
    See Also:
    Constant Field Values

    JAVASCRIPT

    static final java.lang.String JAVASCRIPT
    See Also:
    Constant Field Values

    NON_HTML_PATH_EXTENSION

    static final java.lang.String NON_HTML_PATH_EXTENSION
    See Also:
    Constant Field Values
    Constructor Detail

    RegexpHTMLLinkExtractor

    public RegexpHTMLLinkExtractor()
    Method Detail

    findNextLink

    protected boolean findNextLink()
    Description copied from class: CharSequenceLinkExtractor
    Scan to the next link(s), if any, loading it into the next buffer.

    Specified by:
    findNextLink in class CharSequenceLinkExtractor
    Returns:
    true if any links are found/available, false otherwise

    processGeneralTag

    protected boolean processGeneralTag(java.lang.CharSequence element,
                                        java.lang.CharSequence cs)

    processScriptCode

    protected void processScriptCode(java.lang.CharSequence cs)
    Parameters:
    cs -

    processLink

    protected void processLink(java.lang.CharSequence value,
                               java.lang.CharSequence context)
    Parameters:
    value -
    context -

    processEmbed

    protected long processEmbed(java.lang.CharSequence value,
                                java.lang.CharSequence context)

    processScript

    protected void processScript(java.lang.CharSequence sequence,
                                 int endOfOpenTag)

    processMeta

    protected void processMeta(java.lang.CharSequence cs)

    processStyle

    protected void processStyle(java.lang.CharSequence sequence,
                                int endOfOpenTag)
    Parameters:
    sequence -
    endOfOpenTag -

    reset

    public void reset()
    Discard all state. Another setup() is required to use again.

    Specified by:
    reset in interface LinkExtractor
    Overrides:
    reset in class CharSequenceLinkExtractor

    newDefaultInstance

    protected static CharSequenceLinkExtractor newDefaultInstance()


    Copyright © 2003-2011 Internet Archive. All Rights Reserved.