org.archive.crawler.extractor
Class AggressiveExtractorHTML
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.Extractor
org.archive.crawler.extractor.ExtractorHTML
org.archive.crawler.extractor.AggressiveExtractorHTML
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants
public class AggressiveExtractorHTML
- extends ExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regexp, and than by javascript speculative link regexp.
- Author:
- Igor Ranitovic
- See Also:
- Serialized Form
Field Summary |
(package private) static java.util.logging.Logger |
logger
|
Fields inherited from class org.archive.crawler.extractor.ExtractorHTML |
APPLET, ATTR_EXTRACT_JAVASCRIPT, ATTR_EXTRACT_ONLY_FORM_GETS, ATTR_IGNORE_FORM_ACTION_URLS, ATTR_IGNORE_UNEXPECTED_HTML, ATTR_TREAT_FRAMES_AS_EMBED_LINKS, BASE, CLASSEXT, EACH_ATTRIBUTE_EXTRACTOR, EXTRACT_VALUE_ATTRIBUTES, FRAME, IFRAME, JAVASCRIPT, LINK, MAX_ATTR_VAL_LENGTH, NON_HTML_PATH_EXTENSION, numberOfCURIsHandled, numberOfLinksExtracted, RELEVANT_TAG_EXTRACTOR, WHITESPACE |
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX |
Method Summary |
protected void |
processScript(CrawlURI curi,
java.lang.CharSequence sequence,
int endOfOpenTag)
|
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status
of the processor. |
Methods inherited from class org.archive.crawler.extractor.ExtractorHTML |
addLinkFromString, considerIfLikelyUri, considerQueryStringValues, extract, extract, isHtmlExpectedHere, processEmbed, processEmbed, processGeneralTag, processLink, processMeta, processScriptCode, processStyle |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
logger
static java.util.logging.Logger logger
AggressiveExtractorHTML
public AggressiveExtractorHTML(java.lang.String name)
processScript
protected void processScript(CrawlURI curi,
java.lang.CharSequence sequence,
int endOfOpenTag)
- Overrides:
processScript
in class ExtractorHTML
report
public java.lang.String report()
- Description copied from class:
Processor
- Compiles and returns a report (in human readable form) about the status
of the processor. The processor's name (of implementing class) should
always be included.
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
- Overrides:
report
in class ExtractorHTML
- Returns:
- A human readable report on the processor's state.
Copyright © 2003-2011 Internet Archive. All Rights Reserved.