Extractor (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.extractor
Class Extractor

java.lang.Object
  javax.management.Attribute
      org.archive.crawler.settings.Type
          org.archive.crawler.settings.ComplexType
              org.archive.crawler.settings.ModuleType
                  org.archive.crawler.framework.Processor
                      org.archive.crawler.extractor.Extractor

All Implemented Interfaces:: java.io.Serializable, javax.management.DynamicMBean

Direct Known Subclasses:: ExtractorCSS, ExtractorDOC, ExtractorHTML, ExtractorImpliedURI, ExtractorJS, ExtractorPDF, ExtractorSWF, ExtractorUniversal, ExtractorURI, ExtractorXML, TrapSuppressExtractor

public abstract class Extractor
extends Processor
extends Processor

Convenience shared superclass for Extractor Processors. Currently only wraps Extractor-specific extract() action with a StackOverflowError catch/log/proceed handler, so that any extractors that recurse too deep on problematic input will only suffer a local error, and other normal CrawlURI processing can continue. See: [ 1122836 ] Localize StackOverflowError in Extractors http://sourceforge.net/tracker/index.php?func=detail&aid=1122836&group_id=73833&atid=539099 This class could also become home to common utility features of extractors, like a running tally of the URIs examined/discovered, etc.

Author:: gojomo
See Also:: Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
`ComplexType.MBeanAttributeInfoIterator`

Field Summary

Fields inherited from class org.archive.crawler.framework.Processor
`ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules`

Fields inherited from class org.archive.crawler.settings.ComplexType
`definition, definitionMap`

Constructor Summary
`Extractor(java.lang.String name, java.lang.String description)` Passthrough constructor.

Method Summary
`protected abstract void`	`extract(CrawlURI curi)`
`void`	`innerProcess(CrawlURI curi)` Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
`protected boolean`	`isHttpTransactionContentToProcess(CrawlURI curi)`
`protected boolean`	`isIndependentExtractors()`

Methods inherited from class org.archive.crawler.framework.Processor
`checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn`

Methods inherited from class org.archive.crawler.settings.ModuleType
`addElement, listUsedFiles`

Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.ComplexType

addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.Type
`addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient`

Methods inherited from class javax.management.Attribute
`getName, hashCode`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Constructor Detail

Extractor

public Extractor(java.lang.String name,
                 java.lang.String description)

Passthrough constructor.

Parameters:: name -; description -

Method Detail

innerProcess

public void innerProcess(CrawlURI curi)

Description copied from class: Processor

Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:: innerProcess in class Processor

Parameters:: curi - The CrawlURI being processed.

isIndependentExtractors

protected boolean isIndependentExtractors()

isHttpTransactionContentToProcess

protected boolean isHttpTransactionContentToProcess(CrawlURI curi)

Overrides:: isHttpTransactionContentToProcess in class Processor

Parameters:: curi - CrawlURI to examine.
Returns:: true if the setting CrawlOrder.ATTR_INDEPENDENT_EXTRACTORS is disabled or CrawlURI.hasBeenLinkExtracted() is false, and Processor.isHttpTransactionContentToProcess(CrawlURI) is true.

extract

protected abstract void extract(CrawlURI curi)