org.archive.crawler.extractor
Class AggressiveExtractorHTML

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.Extractor
                          extended by org.archive.crawler.extractor.ExtractorHTML
                              extended by org.archive.crawler.extractor.AggressiveExtractorHTML
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants

public class AggressiveExtractorHTML
extends ExtractorHTML

Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp.

Author:
Igor Ranitovic
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
(package private) static java.util.logging.Logger logger
           
 
Fields inherited from class org.archive.crawler.extractor.ExtractorHTML
APPLET, ATTR_EXTRACT_JAVASCRIPT, ATTR_EXTRACT_ONLY_FORM_GETS, ATTR_IGNORE_FORM_ACTION_URLS, ATTR_IGNORE_UNEXPECTED_HTML, ATTR_TREAT_FRAMES_AS_EMBED_LINKS, BASE, CLASSEXT, EACH_ATTRIBUTE_EXTRACTOR, EXTRACT_VALUE_ATTRIBUTES, FRAME, IFRAME, JAVASCRIPT, LINK, MAX_ATTR_VAL_LENGTH, NON_HTML_PATH_EXTENSION, numberOfCURIsHandled, numberOfLinksExtracted, RELEVANT_TAG_EXTRACTOR, WHITESPACE
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
AggressiveExtractorHTML(java.lang.String name)
           
 
Method Summary
protected  void processScript(CrawlURI curi, java.lang.CharSequence sequence, int endOfOpenTag)
           
 java.lang.String report()
          Compiles and returns a report (in human readable form) about the status of the processor.
 
Methods inherited from class org.archive.crawler.extractor.ExtractorHTML
addLinkFromString, considerIfLikelyUri, considerQueryStringValues, extract, extract, isHtmlExpectedHere, processEmbed, processEmbed, processGeneralTag, processLink, processMeta, processScriptCode, processStyle
 
Methods inherited from class org.archive.crawler.extractor.Extractor
innerProcess, isHttpTransactionContentToProcess, isIndependentExtractors
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

logger

static java.util.logging.Logger logger
Constructor Detail

AggressiveExtractorHTML

public AggressiveExtractorHTML(java.lang.String name)
Method Detail

processScript

protected void processScript(CrawlURI curi,
                             java.lang.CharSequence sequence,
                             int endOfOpenTag)
Overrides:
processScript in class ExtractorHTML

report

public java.lang.String report()
Description copied from class: Processor
Compiles and returns a report (in human readable form) about the status of the processor. The processor's name (of implementing class) should always be included.

Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.

Overrides:
report in class ExtractorHTML
Returns:
A human readable report on the processor's state.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.