org.archive.extractor
Class CharSequenceLinkExtractor

java.lang.Object
  extended by org.archive.extractor.CharSequenceLinkExtractor
All Implemented Interfaces:
java.util.Iterator, LinkExtractor
Direct Known Subclasses:
RegexpCSSLinkExtractor, RegexpHTMLLinkExtractor, RegexpJSLinkExtractor

public abstract class CharSequenceLinkExtractor
extends java.lang.Object
implements LinkExtractor

Abstract superclass providing utility methods for LinkExtractors which would prefer to work on a CharSequence rather than a stream. ROUGH DRAFT IN PROGRESS / incomplete... untested...

Author:
gojomo

Field Summary
protected  UURI base
           
protected  ExtractErrorListener extractErrorListener
           
protected  java.util.LinkedList<Link> next
           
protected  UURI source
           
protected  java.lang.CharSequence sourceContent
           
 
Constructor Summary
CharSequenceLinkExtractor()
           
 
Method Summary
protected  java.lang.CharSequence charSequenceFrom(java.io.InputStream content, java.nio.charset.Charset charset)
           
protected  java.lang.CharSequence createCharSequenceFrom(java.io.InputStream content, java.nio.charset.Charset charset)
           
static void extract(java.lang.CharSequence content, UURI source, UURI base, java.util.List<Link> collector, ExtractErrorListener extractErrorListener)
          Convenience method to do default extraction.
protected abstract  boolean findNextLink()
          Scan to the next link(s), if any, loading it into the next buffer.
 boolean hasNext()
           
protected static CharSequenceLinkExtractor newDefaultInstance()
           
 java.lang.Object next()
           
 Link nextLink()
          Alternative to Iterator.next() which returns type Link.
 void remove()
           
 void reset()
          Discard all state.
 void setup(UURI sourceandbase, java.lang.CharSequence content, ExtractErrorListener listener)
          Convenience method for when source and base are same.
 void setup(UURI sourceandbase, java.io.InputStream content, java.nio.charset.Charset charset, ExtractErrorListener listener)
          Convenience version of above for common case where source and base are same.
 void setup(UURI source, UURI base, java.lang.CharSequence content, ExtractErrorListener listener)
           
 void setup(UURI source, UURI base, java.io.InputStream content, java.nio.charset.Charset charset, ExtractErrorListener listener)
          Setup the LinkExtractor to operate on the given stream and charset, considering the given contextURI as the initial 'base' URI for resolving relative URIs.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

source

protected UURI source

base

protected UURI base

extractErrorListener

protected ExtractErrorListener extractErrorListener

sourceContent

protected java.lang.CharSequence sourceContent

next

protected java.util.LinkedList<Link> next
Constructor Detail

CharSequenceLinkExtractor

public CharSequenceLinkExtractor()
Method Detail

setup

public void setup(UURI source,
                  UURI base,
                  java.io.InputStream content,
                  java.nio.charset.Charset charset,
                  ExtractErrorListener listener)
Description copied from interface: LinkExtractor
Setup the LinkExtractor to operate on the given stream and charset, considering the given contextURI as the initial 'base' URI for resolving relative URIs. May be called to 'reset' a LinkExtractor to start with new input.

Specified by:
setup in interface LinkExtractor
Parameters:
source - source URI
base - base URI (usually the source URI) for URI derelativizing
content - input stream of content to scan for links
charset - Charset to consult to decode stream to characters
listener - ExtractErrorListener to notify, rather than raising exception through extraction loop

setup

public void setup(UURI source,
                  UURI base,
                  java.lang.CharSequence content,
                  ExtractErrorListener listener)
Parameters:
source -
base -
content -
listener -

setup

public void setup(UURI sourceandbase,
                  java.lang.CharSequence content,
                  ExtractErrorListener listener)
Convenience method for when source and base are same.

Parameters:
sourceandbase -
content -
listener -

setup

public void setup(UURI sourceandbase,
                  java.io.InputStream content,
                  java.nio.charset.Charset charset,
                  ExtractErrorListener listener)
Description copied from interface: LinkExtractor
Convenience version of above for common case where source and base are same.

Specified by:
setup in interface LinkExtractor
Parameters:
sourceandbase - URI to use as source and base for derelativizing
content - input stream of content to scan for links
charset - Charset to consult to decode stream to characters
listener - ExtractErrorListener to notify, rather than raising exception through extraction loop

nextLink

public Link nextLink()
Description copied from interface: LinkExtractor
Alternative to Iterator.next() which returns type Link.

Specified by:
nextLink in interface LinkExtractor
Returns:
a discovered Link

reset

public void reset()
Discard all state. Another setup() is required to use again.

Specified by:
reset in interface LinkExtractor

hasNext

public boolean hasNext()
Specified by:
hasNext in interface java.util.Iterator

findNextLink

protected abstract boolean findNextLink()
Scan to the next link(s), if any, loading it into the next buffer.

Returns:
true if any links are found/available, false otherwise

next

public java.lang.Object next()
Specified by:
next in interface java.util.Iterator

remove

public void remove()
Specified by:
remove in interface java.util.Iterator

charSequenceFrom

protected java.lang.CharSequence charSequenceFrom(java.io.InputStream content,
                                                  java.nio.charset.Charset charset)
Parameters:
content -
charset -
Returns:
CharSequence obtained from stream in given charset

createCharSequenceFrom

protected java.lang.CharSequence createCharSequenceFrom(java.io.InputStream content,
                                                        java.nio.charset.Charset charset)
Parameters:
content -
charset -
Returns:
CharSequence built over given stream in given charset

extract

public static void extract(java.lang.CharSequence content,
                           UURI source,
                           UURI base,
                           java.util.List<Link> collector,
                           ExtractErrorListener extractErrorListener)
Convenience method to do default extraction.

Parameters:
content -
source -
base -
collector -
extractErrorListener -

newDefaultInstance

protected static CharSequenceLinkExtractor newDefaultInstance()


Copyright © 2003-2011 Internet Archive. All Rights Reserved.