org.archive.crawler.extractor
Class JerichoExtractorHTML

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.Extractor
                          extended by org.archive.crawler.extractor.ExtractorHTML
                              extended by org.archive.crawler.extractor.JerichoExtractorHTML
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants

public class JerichoExtractorHTML
extends ExtractorHTML
implements CoreAttributeConstants

Improved link-extraction from an HTML content-body using jericho-html parser. This extractor extends ExtractorHTML and mimics its workflow - but has some substantial differences when it comes to internal implementation. Instead of heavily relying upon java regular expressions it uses a real html parser library - namely Jericho HTML Parser (http://jerichohtml.sourceforge.net). Using this parser it can better handle broken html (i.e. missing quotes) and also offer improved extraction of HTML form URLs (not only extract the action of a form, but also its default values). Unfortunately this parser also has one major drawback - it has to read the whole document into memory for parsing, thus has an inherent OOME risk. This OOME risk can be reduced/eleminated by limiting the size of documents to be parsed (i.e. using NotExceedsDocumentLengthTresholdDecideRule). Also note that this extractor seems to have a lower overall memory consumption compared to ExtractorHTML. (still to be confirmed on a larger scale crawl)

Version:
$Date: 2010-04-21 23:39:57 +0000 (Wed, 21 Apr 2010) $ $Revision: 6830 $
Author:
Olaf Freyer
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
protected  long numberOfFormsProcessed
           
 
Fields inherited from class org.archive.crawler.extractor.ExtractorHTML
APPLET, ATTR_EXTRACT_JAVASCRIPT, ATTR_EXTRACT_ONLY_FORM_GETS, ATTR_IGNORE_FORM_ACTION_URLS, ATTR_IGNORE_UNEXPECTED_HTML, ATTR_TREAT_FRAMES_AS_EMBED_LINKS, BASE, CLASSEXT, EACH_ATTRIBUTE_EXTRACTOR, EXTRACT_VALUE_ATTRIBUTES, FRAME, IFRAME, JAVASCRIPT, LINK, MAX_ATTR_VAL_LENGTH, NON_HTML_PATH_EXTENSION, numberOfCURIsHandled, numberOfLinksExtracted, RELEVANT_TAG_EXTRACTOR, WHITESPACE
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
JerichoExtractorHTML(java.lang.String name)
           
JerichoExtractorHTML(java.lang.String name, java.lang.String description)
           
 
Method Summary
(package private)  void extract(CrawlURI curi, java.lang.CharSequence cs)
          Run extractor.
protected  void processForm(CrawlURI curi, au.id.jericho.lib.html.Element element)
           
protected  void processGeneralTag(CrawlURI curi, au.id.jericho.lib.html.Element element, au.id.jericho.lib.html.Attributes attributes)
           
protected  boolean processMeta(CrawlURI curi, au.id.jericho.lib.html.Element element)
           
protected  void processScript(CrawlURI curi, au.id.jericho.lib.html.Element element)
           
protected  void processStyle(CrawlURI curi, au.id.jericho.lib.html.Element element)
           
 java.lang.String report()
          Compiles and returns a report (in human readable form) about the status of the processor.
 
Methods inherited from class org.archive.crawler.extractor.ExtractorHTML
addLinkFromString, considerIfLikelyUri, considerQueryStringValues, extract, isHtmlExpectedHere, processEmbed, processEmbed, processGeneralTag, processLink, processMeta, processScript, processScriptCode, processStyle
 
Methods inherited from class org.archive.crawler.extractor.Extractor
innerProcess, isHttpTransactionContentToProcess, isIndependentExtractors
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

numberOfFormsProcessed

protected long numberOfFormsProcessed
Constructor Detail

JerichoExtractorHTML

public JerichoExtractorHTML(java.lang.String name)

JerichoExtractorHTML

public JerichoExtractorHTML(java.lang.String name,
                            java.lang.String description)
Method Detail

processGeneralTag

protected void processGeneralTag(CrawlURI curi,
                                 au.id.jericho.lib.html.Element element,
                                 au.id.jericho.lib.html.Attributes attributes)

processMeta

protected boolean processMeta(CrawlURI curi,
                              au.id.jericho.lib.html.Element element)

processScript

protected void processScript(CrawlURI curi,
                             au.id.jericho.lib.html.Element element)

processStyle

protected void processStyle(CrawlURI curi,
                            au.id.jericho.lib.html.Element element)

processForm

protected void processForm(CrawlURI curi,
                           au.id.jericho.lib.html.Element element)

extract

void extract(CrawlURI curi,
             java.lang.CharSequence cs)
Run extractor. This method is package visible to ease testing.

Overrides:
extract in class ExtractorHTML
Parameters:
curi - CrawlURI we're processing.
cs - Sequence from underlying ReplayCharSequence.

report

public java.lang.String report()
Description copied from class: Processor
Compiles and returns a report (in human readable form) about the status of the processor. The processor's name (of implementing class) should always be included.

Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.

Overrides:
report in class ExtractorHTML
Returns:
A human readable report on the processor's state.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.