org.archive.extractor
Interface LinkExtractor

All Superinterfaces:
java.util.Iterator
All Known Implementing Classes:
CharSequenceLinkExtractor, RegexpCSSLinkExtractor, RegexpHTMLLinkExtractor, RegexpJSLinkExtractor

public interface LinkExtractor
extends java.util.Iterator

LinkExtractor is a general interface for classes which, when given an InputStream and Charset, can scan for Links and return them via an Iterator interface. Implementors may in fact complete all extraction on the first hasNext(), then trickle Links out from an internal collection, depending on whether the link-extraction technique used is amenable to incremental scanning. ROUGH DRAFT IN PROGRESS / incomplete... untested...

Author:
gojomo

Method Summary
 Link nextLink()
          Alternative to Iterator.next() which returns type Link.
 void reset()
          Discard all state and release any used resources.
 void setup(UURI sourceandbase, java.io.InputStream content, java.nio.charset.Charset charset, ExtractErrorListener listener)
          Convenience version of above for common case where source and base are same.
 void setup(UURI source, UURI base, java.io.InputStream content, java.nio.charset.Charset charset, ExtractErrorListener listener)
          Setup the LinkExtractor to operate on the given stream and charset, considering the given contextURI as the initial 'base' URI for resolving relative URIs.
 
Methods inherited from interface java.util.Iterator
hasNext, next, remove
 

Method Detail

setup

void setup(UURI source,
           UURI base,
           java.io.InputStream content,
           java.nio.charset.Charset charset,
           ExtractErrorListener listener)
Setup the LinkExtractor to operate on the given stream and charset, considering the given contextURI as the initial 'base' URI for resolving relative URIs. May be called to 'reset' a LinkExtractor to start with new input.

Parameters:
source - source URI
base - base URI (usually the source URI) for URI derelativizing
content - input stream of content to scan for links
charset - Charset to consult to decode stream to characters
listener - ExtractErrorListener to notify, rather than raising exception through extraction loop

setup

void setup(UURI sourceandbase,
           java.io.InputStream content,
           java.nio.charset.Charset charset,
           ExtractErrorListener listener)
Convenience version of above for common case where source and base are same.

Parameters:
sourceandbase - URI to use as source and base for derelativizing
content - input stream of content to scan for links
charset - Charset to consult to decode stream to characters
listener - ExtractErrorListener to notify, rather than raising exception through extraction loop

nextLink

Link nextLink()
Alternative to Iterator.next() which returns type Link.

Returns:
a discovered Link

reset

void reset()
Discard all state and release any used resources.



Copyright © 2003-2011 Internet Archive. All Rights Reserved.