org.archive.crawler.extractor
Class Extractor

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.Extractor
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean
Direct Known Subclasses:
ExtractorCSS, ExtractorDOC, ExtractorHTML, ExtractorImpliedURI, ExtractorJS, ExtractorPDF, ExtractorSWF, ExtractorUniversal, ExtractorURI, ExtractorXML, TrapSuppressExtractor

public abstract class Extractor
extends Processor

Convenience shared superclass for Extractor Processors. Currently only wraps Extractor-specific extract() action with a StackOverflowError catch/log/proceed handler, so that any extractors that recurse too deep on problematic input will only suffer a local error, and other normal CrawlURI processing can continue. See: [ 1122836 ] Localize StackOverflowError in Extractors http://sourceforge.net/tracker/index.php?func=detail&aid=1122836&group_id=73833&atid=539099 This class could also become home to common utility features of extractors, like a running tally of the URIs examined/discovered, etc.

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
Extractor(java.lang.String name, java.lang.String description)
          Passthrough constructor.
 
Method Summary
protected abstract  void extract(CrawlURI curi)
           
 void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
protected  boolean isHttpTransactionContentToProcess(CrawlURI curi)
           
protected  boolean isIndependentExtractors()
           
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Extractor

public Extractor(java.lang.String name,
                 java.lang.String description)
Passthrough constructor.

Parameters:
name -
description -
Method Detail

innerProcess

public void innerProcess(CrawlURI curi)
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.

isIndependentExtractors

protected boolean isIndependentExtractors()

isHttpTransactionContentToProcess

protected boolean isHttpTransactionContentToProcess(CrawlURI curi)
Overrides:
isHttpTransactionContentToProcess in class Processor
Parameters:
curi - CrawlURI to examine.
Returns:
true if the setting CrawlOrder.ATTR_INDEPENDENT_EXTRACTORS is disabled or CrawlURI.hasBeenLinkExtracted() is false, and Processor.isHttpTransactionContentToProcess(CrawlURI) is true.

extract

protected abstract void extract(CrawlURI curi)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.