org.archive.extractor
Class RegexpJSLinkExtractor

java.lang.Object
  extended by org.archive.extractor.CharSequenceLinkExtractor
      extended by org.archive.extractor.RegexpJSLinkExtractor
All Implemented Interfaces:
java.util.Iterator, LinkExtractor

public class RegexpJSLinkExtractor
extends CharSequenceLinkExtractor

Uses regular expressions to find likely URIs inside Javascript. ROUGH DRAFT IN PROGRESS / incomplete... untested...

Author:
gojomo

Field Summary
(package private) static java.lang.String AMP
           
(package private) static java.lang.String ESCAPED_AMP
           
(package private) static java.util.regex.Pattern JAVASCRIPT_STRING_EXTRACTOR
           
(package private)  java.util.LinkedList<java.util.regex.Matcher> matcherStack
           
(package private) static java.util.regex.Pattern STRING_URI_DETECTOR
           
(package private)  java.util.regex.Matcher strings
           
(package private) static java.lang.String WHITESPACE
           
 
Fields inherited from class org.archive.extractor.CharSequenceLinkExtractor
base, extractErrorListener, next, source, sourceContent
 
Constructor Summary
RegexpJSLinkExtractor()
           
 
Method Summary
protected  boolean findNextLink()
          Scan to the next link(s), if any, loading it into the next buffer.
protected static CharSequenceLinkExtractor newDefaultInstance()
           
 void reset()
          Discard all state.
 
Methods inherited from class org.archive.extractor.CharSequenceLinkExtractor
charSequenceFrom, createCharSequenceFrom, extract, hasNext, next, nextLink, remove, setup, setup, setup, setup
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

AMP

static final java.lang.String AMP
See Also:
Constant Field Values

ESCAPED_AMP

static final java.lang.String ESCAPED_AMP
See Also:
Constant Field Values

WHITESPACE

static final java.lang.String WHITESPACE
See Also:
Constant Field Values

JAVASCRIPT_STRING_EXTRACTOR

static final java.util.regex.Pattern JAVASCRIPT_STRING_EXTRACTOR

STRING_URI_DETECTOR

static final java.util.regex.Pattern STRING_URI_DETECTOR

strings

java.util.regex.Matcher strings

matcherStack

java.util.LinkedList<java.util.regex.Matcher> matcherStack
Constructor Detail

RegexpJSLinkExtractor

public RegexpJSLinkExtractor()
Method Detail

findNextLink

protected boolean findNextLink()
Description copied from class: CharSequenceLinkExtractor
Scan to the next link(s), if any, loading it into the next buffer.

Specified by:
findNextLink in class CharSequenceLinkExtractor
Returns:
true if any links are found/available, false otherwise

reset

public void reset()
Description copied from class: CharSequenceLinkExtractor
Discard all state. Another setup() is required to use again.

Specified by:
reset in interface LinkExtractor
Overrides:
reset in class CharSequenceLinkExtractor

newDefaultInstance

protected static CharSequenceLinkExtractor newDefaultInstance()


Copyright © 2003-2011 Internet Archive. All Rights Reserved.