org.archive.crawler.extractor
Class JerichoExtractorHTML
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.Extractor
org.archive.crawler.extractor.ExtractorHTML
org.archive.crawler.extractor.JerichoExtractorHTML
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants
public class JerichoExtractorHTML
- extends ExtractorHTML
- implements CoreAttributeConstants
Improved link-extraction from an HTML content-body using jericho-html parser.
This extractor extends ExtractorHTML and mimics its workflow - but has some
substantial differences when it comes to internal implementation. Instead
of heavily relying upon java regular expressions it uses a real html parser
library - namely Jericho HTML Parser (http://jerichohtml.sourceforge.net).
Using this parser it can better handle broken html (i.e. missing quotes)
and also offer improved extraction of HTML form URLs (not only extract
the action of a form, but also its default values).
Unfortunately this parser also has one major drawback - it has to read the
whole document into memory for parsing, thus has an inherent OOME risk.
This OOME risk can be reduced/eleminated by limiting the size of documents
to be parsed (i.e. using NotExceedsDocumentLengthTresholdDecideRule).
Also note that this extractor seems to have a lower overall memory
consumption compared to ExtractorHTML. (still to be confirmed on a larger
scale crawl)
- Version:
- $Date: 2010-04-21 23:39:57 +0000 (Wed, 21 Apr 2010) $ $Revision: 6830 $
- Author:
- Olaf Freyer
- See Also:
- Serialized Form
Fields inherited from class org.archive.crawler.extractor.ExtractorHTML |
APPLET, ATTR_EXTRACT_JAVASCRIPT, ATTR_EXTRACT_ONLY_FORM_GETS, ATTR_IGNORE_FORM_ACTION_URLS, ATTR_IGNORE_UNEXPECTED_HTML, ATTR_TREAT_FRAMES_AS_EMBED_LINKS, BASE, CLASSEXT, EACH_ATTRIBUTE_EXTRACTOR, EXTRACT_VALUE_ATTRIBUTES, FRAME, IFRAME, JAVASCRIPT, LINK, MAX_ATTR_VAL_LENGTH, NON_HTML_PATH_EXTENSION, numberOfCURIsHandled, numberOfLinksExtracted, RELEVANT_TAG_EXTRACTOR, WHITESPACE |
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX |
Method Summary |
(package private) void |
extract(CrawlURI curi,
java.lang.CharSequence cs)
Run extractor. |
protected void |
processForm(CrawlURI curi,
au.id.jericho.lib.html.Element element)
|
protected void |
processGeneralTag(CrawlURI curi,
au.id.jericho.lib.html.Element element,
au.id.jericho.lib.html.Attributes attributes)
|
protected boolean |
processMeta(CrawlURI curi,
au.id.jericho.lib.html.Element element)
|
protected void |
processScript(CrawlURI curi,
au.id.jericho.lib.html.Element element)
|
protected void |
processStyle(CrawlURI curi,
au.id.jericho.lib.html.Element element)
|
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status
of the processor. |
Methods inherited from class org.archive.crawler.extractor.ExtractorHTML |
addLinkFromString, considerIfLikelyUri, considerQueryStringValues, extract, isHtmlExpectedHere, processEmbed, processEmbed, processGeneralTag, processLink, processMeta, processScript, processScriptCode, processStyle |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
numberOfFormsProcessed
protected long numberOfFormsProcessed
JerichoExtractorHTML
public JerichoExtractorHTML(java.lang.String name)
JerichoExtractorHTML
public JerichoExtractorHTML(java.lang.String name,
java.lang.String description)
processGeneralTag
protected void processGeneralTag(CrawlURI curi,
au.id.jericho.lib.html.Element element,
au.id.jericho.lib.html.Attributes attributes)
processMeta
protected boolean processMeta(CrawlURI curi,
au.id.jericho.lib.html.Element element)
processScript
protected void processScript(CrawlURI curi,
au.id.jericho.lib.html.Element element)
processStyle
protected void processStyle(CrawlURI curi,
au.id.jericho.lib.html.Element element)
processForm
protected void processForm(CrawlURI curi,
au.id.jericho.lib.html.Element element)
extract
void extract(CrawlURI curi,
java.lang.CharSequence cs)
- Run extractor. This method is package visible to ease testing.
- Overrides:
extract
in class ExtractorHTML
- Parameters:
curi
- CrawlURI we're processing.cs
- Sequence from underlying ReplayCharSequence.
report
public java.lang.String report()
- Description copied from class:
Processor
- Compiles and returns a report (in human readable form) about the status
of the processor. The processor's name (of implementing class) should
always be included.
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
- Overrides:
report
in class ExtractorHTML
- Returns:
- A human readable report on the processor's state.
Copyright © 2003-2011 Internet Archive. All Rights Reserved.