org.archive.crawler.extractor
Class ExtractorHTML

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.Extractor
                          extended by org.archive.crawler.extractor.ExtractorHTML
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants
Direct Known Subclasses:
AggressiveExtractorHTML, JerichoExtractorHTML

public class ExtractorHTML
extends Extractor
implements CoreAttributeConstants

Basic link-extraction, from an HTML content-body, using regular expressions.

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
(package private) static java.lang.String APPLET
           
static java.lang.String ATTR_EXTRACT_JAVASCRIPT
          whether to try finding links in Javscript; default true
static java.lang.String ATTR_EXTRACT_ONLY_FORM_GETS
           
static java.lang.String ATTR_IGNORE_FORM_ACTION_URLS
           
static java.lang.String ATTR_IGNORE_UNEXPECTED_HTML
           
static java.lang.String ATTR_TREAT_FRAMES_AS_EMBED_LINKS
           
(package private) static java.lang.String BASE
           
(package private) static java.lang.String CLASSEXT
           
(package private) static java.lang.String EACH_ATTRIBUTE_EXTRACTOR
           
static java.lang.String EXTRACT_VALUE_ATTRIBUTES
           
(package private) static java.lang.String FRAME
           
(package private) static java.lang.String IFRAME
           
(package private) static java.lang.String JAVASCRIPT
           
(package private) static java.lang.String LINK
           
(package private) static int MAX_ATTR_VAL_LENGTH
           
(package private) static java.lang.String NON_HTML_PATH_EXTENSION
           
protected  long numberOfCURIsHandled
           
protected  long numberOfLinksExtracted
           
(package private) static java.lang.String RELEVANT_TAG_EXTRACTOR
           
(package private) static java.lang.String WHITESPACE
           
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
ExtractorHTML(java.lang.String name)
           
ExtractorHTML(java.lang.String name, java.lang.String description)
           
 
Method Summary
protected  void addLinkFromString(CrawlURI curi, java.lang.CharSequence uri, java.lang.CharSequence context, char hopType)
           
protected  void considerIfLikelyUri(CrawlURI curi, java.lang.CharSequence candidate, java.lang.CharSequence valueContext, char hopType)
          Consider whether a given string is URI-like.
protected  void considerQueryStringValues(CrawlURI curi, java.lang.CharSequence queryString, java.lang.CharSequence valueContext, char hopType)
          Consider a query-string-like collections of key=value[&key=value] pairs for URI-like strings in the values.
 void extract(CrawlURI curi)
           
(package private)  void extract(CrawlURI curi, java.lang.CharSequence cs)
          Run extractor.
protected  boolean isHtmlExpectedHere(CrawlURI curi)
          Test whether this HTML is so unexpected (eg in place of a GIF URI) that it shouldn't be scanned for links.
protected  void processEmbed(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context)
           
protected  void processEmbed(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context, char hopType)
           
protected  void processGeneralTag(CrawlURI curi, java.lang.CharSequence element, java.lang.CharSequence cs)
           
protected  void processLink(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context)
          Handle generic HREF cases.
protected  boolean processMeta(CrawlURI curi, java.lang.CharSequence cs)
          Process metadata tags.
protected  void processScript(CrawlURI curi, java.lang.CharSequence sequence, int endOfOpenTag)
           
protected  void processScriptCode(CrawlURI curi, java.lang.CharSequence cs)
          Extract the (java)script source in the given CharSequence.
protected  void processStyle(CrawlURI curi, java.lang.CharSequence sequence, int endOfOpenTag)
          Process style text.
 java.lang.String report()
          Compiles and returns a report (in human readable form) about the status of the processor.
 
Methods inherited from class org.archive.crawler.extractor.Extractor
innerProcess, isHttpTransactionContentToProcess, isIndependentExtractors
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

RELEVANT_TAG_EXTRACTOR

static final java.lang.String RELEVANT_TAG_EXTRACTOR

MAX_ATTR_VAL_LENGTH

static final int MAX_ATTR_VAL_LENGTH

EACH_ATTRIBUTE_EXTRACTOR

static final java.lang.String EACH_ATTRIBUTE_EXTRACTOR

WHITESPACE

static final java.lang.String WHITESPACE
See Also:
Constant Field Values

CLASSEXT

static final java.lang.String CLASSEXT
See Also:
Constant Field Values

APPLET

static final java.lang.String APPLET
See Also:
Constant Field Values

BASE

static final java.lang.String BASE
See Also:
Constant Field Values

LINK

static final java.lang.String LINK
See Also:
Constant Field Values

FRAME

static final java.lang.String FRAME
See Also:
Constant Field Values

IFRAME

static final java.lang.String IFRAME
See Also:
Constant Field Values

ATTR_TREAT_FRAMES_AS_EMBED_LINKS

public static final java.lang.String ATTR_TREAT_FRAMES_AS_EMBED_LINKS
See Also:
Constant Field Values

ATTR_IGNORE_FORM_ACTION_URLS

public static final java.lang.String ATTR_IGNORE_FORM_ACTION_URLS
See Also:
Constant Field Values

ATTR_EXTRACT_ONLY_FORM_GETS

public static final java.lang.String ATTR_EXTRACT_ONLY_FORM_GETS
See Also:
Constant Field Values

ATTR_EXTRACT_JAVASCRIPT

public static final java.lang.String ATTR_EXTRACT_JAVASCRIPT
whether to try finding links in Javscript; default true

See Also:
Constant Field Values

EXTRACT_VALUE_ATTRIBUTES

public static final java.lang.String EXTRACT_VALUE_ATTRIBUTES
See Also:
Constant Field Values

ATTR_IGNORE_UNEXPECTED_HTML

public static final java.lang.String ATTR_IGNORE_UNEXPECTED_HTML
See Also:
Constant Field Values

numberOfCURIsHandled

protected long numberOfCURIsHandled

numberOfLinksExtracted

protected long numberOfLinksExtracted

JAVASCRIPT

static final java.lang.String JAVASCRIPT
See Also:
Constant Field Values

NON_HTML_PATH_EXTENSION

static final java.lang.String NON_HTML_PATH_EXTENSION
See Also:
Constant Field Values
Constructor Detail

ExtractorHTML

public ExtractorHTML(java.lang.String name)

ExtractorHTML

public ExtractorHTML(java.lang.String name,
                     java.lang.String description)
Method Detail

processGeneralTag

protected void processGeneralTag(CrawlURI curi,
                                 java.lang.CharSequence element,
                                 java.lang.CharSequence cs)

considerQueryStringValues

protected void considerQueryStringValues(CrawlURI curi,
                                         java.lang.CharSequence queryString,
                                         java.lang.CharSequence valueContext,
                                         char hopType)
Consider a query-string-like collections of key=value[&key=value] pairs for URI-like strings in the values. Where URI-like strings are found, add as discovered outlink.

Parameters:
curi - origin CrawlURI
queryString - query-string-like string
valueContext - page context where found

considerIfLikelyUri

protected void considerIfLikelyUri(CrawlURI curi,
                                   java.lang.CharSequence candidate,
                                   java.lang.CharSequence valueContext,
                                   char hopType)
Consider whether a given string is URI-like. If so, add as discovered outlink.

Parameters:
curi - origin CrawlURI
queryString - query-string-like string
valueContext - page context where found

processScriptCode

protected void processScriptCode(CrawlURI curi,
                                 java.lang.CharSequence cs)
Extract the (java)script source in the given CharSequence.

Parameters:
curi - source CrawlURI
cs - CharSequence of javascript code

processLink

protected void processLink(CrawlURI curi,
                           java.lang.CharSequence value,
                           java.lang.CharSequence context)
Handle generic HREF cases.

Parameters:
curi -
value -
context -

addLinkFromString

protected void addLinkFromString(CrawlURI curi,
                                 java.lang.CharSequence uri,
                                 java.lang.CharSequence context,
                                 char hopType)

processEmbed

protected final void processEmbed(CrawlURI curi,
                                  java.lang.CharSequence value,
                                  java.lang.CharSequence context)

processEmbed

protected void processEmbed(CrawlURI curi,
                            java.lang.CharSequence value,
                            java.lang.CharSequence context,
                            char hopType)

extract

public void extract(CrawlURI curi)
Specified by:
extract in class Extractor

extract

void extract(CrawlURI curi,
             java.lang.CharSequence cs)
Run extractor. This method is package visible to ease testing.

Parameters:
curi - CrawlURI we're processing.
cs - Sequence from underlying ReplayCharSequence. This is TRANSIENT data. Make a copy if you want the data to live outside of this extractors' lifetime.

isHtmlExpectedHere

protected boolean isHtmlExpectedHere(CrawlURI curi)
                              throws org.apache.commons.httpclient.URIException
Test whether this HTML is so unexpected (eg in place of a GIF URI) that it shouldn't be scanned for links.

Parameters:
curi - CrawlURI to examine.
Returns:
True if HTML is acceptable/expected here
Throws:
org.apache.commons.httpclient.URIException

processScript

protected void processScript(CrawlURI curi,
                             java.lang.CharSequence sequence,
                             int endOfOpenTag)

processMeta

protected boolean processMeta(CrawlURI curi,
                              java.lang.CharSequence cs)
Process metadata tags.

Parameters:
curi - CrawlURI we're processing.
cs - Sequence from underlying ReplayCharSequence. This is TRANSIENT data. Make a copy if you want the data to live outside of this extractors' lifetime.
Returns:
True robots exclusion metatag.

processStyle

protected void processStyle(CrawlURI curi,
                            java.lang.CharSequence sequence,
                            int endOfOpenTag)
Process style text.

Parameters:
curi - CrawlURI we're processing.
sequence - Sequence from underlying ReplayCharSequence. This is TRANSIENT data. Make a copy if you want the data to live outside of this extractors' lifetime.
endOfOpenTag -

report

public java.lang.String report()
Description copied from class: Processor
Compiles and returns a report (in human readable form) about the status of the processor. The processor's name (of implementing class) should always be included.

Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.

Overrides:
report in class Processor
Returns:
A human readable report on the processor's state.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.