org.archive.crawler.extractor
Class Extractor
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.Extractor
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean
- Direct Known Subclasses:
- ExtractorCSS, ExtractorDOC, ExtractorHTML, ExtractorImpliedURI, ExtractorJS, ExtractorPDF, ExtractorSWF, ExtractorUniversal, ExtractorURI, ExtractorXML, TrapSuppressExtractor
public abstract class Extractor
- extends Processor
Convenience shared superclass for Extractor Processors.
Currently only wraps Extractor-specific extract() action with
a StackOverflowError catch/log/proceed handler, so that any
extractors that recurse too deep on problematic input will
only suffer a local error, and other normal CrawlURI processing
can continue. See:
[ 1122836 ] Localize StackOverflowError in Extractors
http://sourceforge.net/tracker/index.php?func=detail&aid=1122836&group_id=73833&atid=539099
This class could also become home to common utility features
of extractors, like a running tally of the URIs examined/discovered,
etc.
- Author:
- gojomo
- See Also:
- Serialized Form
Constructor Summary |
Extractor(java.lang.String name,
java.lang.String description)
Passthrough constructor. |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Extractor
public Extractor(java.lang.String name,
java.lang.String description)
- Passthrough constructor.
- Parameters:
name
- description
-
innerProcess
public void innerProcess(CrawlURI curi)
- Description copied from class:
Processor
- Classes subclassing this one should override this method to perform
their custom actions on the CrawlURI.
- Overrides:
innerProcess
in class Processor
- Parameters:
curi
- The CrawlURI being processed.
isIndependentExtractors
protected boolean isIndependentExtractors()
isHttpTransactionContentToProcess
protected boolean isHttpTransactionContentToProcess(CrawlURI curi)
- Overrides:
isHttpTransactionContentToProcess
in class Processor
- Parameters:
curi
- CrawlURI to examine.
- Returns:
- true if the setting
CrawlOrder.ATTR_INDEPENDENT_EXTRACTORS
is disabled or
CrawlURI.hasBeenLinkExtracted()
is false, and
Processor.isHttpTransactionContentToProcess(CrawlURI)
is
true.
extract
protected abstract void extract(CrawlURI curi)
Copyright © 2003-2011 Internet Archive. All Rights Reserved.