org.archive.crawler.framework
Class Processor

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean
Direct Known Subclasses:
AcceptRevisitProcessor, BeanShellProcessor, ChangeEvaluator, CrawlMapper, CrawlStateUpdater, Extractor, ExtractorHTTP, FetchDNS, FetchFTP, FetchHistoryProcessor, FetchHTTP, FrontierScheduler, HTTPContentDigest, Kw3WriterProcessor, LowDiskPauseProcessor, MirrorWriterProcessor, PersistProcessor, PreconditionEnforcer, QuotaEnforcer, RejectRevisitProcessor, RuntimeLimitEnforcer, Scoper, WaitEvaluator, WriterPoolProcessor

public class Processor
extends ModuleType

Base class for URI processing classes.

Each URI is processed by a user defined series of processors. This class provides the basic infrastructure for these but does not actually do anything. New processors can be easily created by subclassing this class.

Classes subclassing this one should not trap InterruptedExceptions. They should be allowed to propagate to the ToeThread executing the processor. Also they should immediately exit their main method (innerProcess()) if the interrupted flag is set.

Author:
Gordon Mohr
See Also:
ToeThread, Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_DECIDE_RULES
          Key to use asking settings for decide-rules value.
static java.lang.String ATTR_ENABLED
          Key to use asking settings for enabled value.
protected  java.lang.String attrDecideRules
          local name for decide-rules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
Processor(java.lang.String name, java.lang.String description)
           
 
Method Summary
protected  void checkForInterrupt()
           
protected  void finalTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
 CrawlController getController()
          Get the controller object.
protected  DecideRule getDecideRule(java.lang.Object o)
           
 Processor getDefaultNextProcessor(CrawlURI curi)
          Returns the next processor for the given CrawlURI in the processor chain.
protected  void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
protected  void innerRejectProcess(CrawlURI curi)
           
protected  boolean isContentToProcess(CrawlURI curi)
           
 boolean isEnabled()
           
protected  boolean isExpectedMimeType(java.lang.String contentType, java.lang.String expectedPrefix)
           
protected  boolean isHttpTransactionContentToProcess(CrawlURI curi)
           
 void kickUpdate()
           
 void process(CrawlURI curi)
          Perform processing on the given CrawlURI.
 java.lang.String report()
          Compiles and returns a report (in human readable form) about the status of the processor.
protected  boolean rulesAccept(DecideRule rule, java.lang.Object o)
           
protected  boolean rulesAccept(java.lang.Object o)
           
 void setDefaultNextProcessor(Processor nextProcessor)
          Set the default next processor in the chain.
 Processor spawn(int serialNum)
           
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_DECIDE_RULES

public static final java.lang.String ATTR_DECIDE_RULES
Key to use asking settings for decide-rules value.

See Also:
Constant Field Values

attrDecideRules

protected java.lang.String attrDecideRules
local name for decide-rules


ATTR_ENABLED

public static final java.lang.String ATTR_ENABLED
Key to use asking settings for enabled value.

See Also:
Constant Field Values
Constructor Detail

Processor

public Processor(java.lang.String name,
                 java.lang.String description)
Parameters:
name -
description -
Method Detail

process

public final void process(CrawlURI curi)
                   throws java.lang.InterruptedException
Perform processing on the given CrawlURI.

Parameters:
curi -
Throws:
java.lang.InterruptedException

checkForInterrupt

protected void checkForInterrupt()
                          throws java.lang.InterruptedException
Throws:
java.lang.InterruptedException

innerRejectProcess

protected void innerRejectProcess(CrawlURI curi)
                           throws java.lang.InterruptedException
Parameters:
curi - CrawlURI instance.
Throws:
java.lang.InterruptedException

innerProcess

protected void innerProcess(CrawlURI curi)
                     throws java.lang.InterruptedException
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Parameters:
curi - The CrawlURI being processed.
Throws:
java.lang.InterruptedException

initialTasks

protected void initialTasks()
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.


finalTasks

protected void finalTasks()
Classes subclassing this one should override this method to perform processor specific actions.


getDecideRule

protected DecideRule getDecideRule(java.lang.Object o)

rulesAccept

protected boolean rulesAccept(java.lang.Object o)

rulesAccept

protected boolean rulesAccept(DecideRule rule,
                              java.lang.Object o)

getDefaultNextProcessor

public Processor getDefaultNextProcessor(CrawlURI curi)
Returns the next processor for the given CrawlURI in the processor chain.

Parameters:
curi - The CrawlURI that we want to find the next processor for.
Returns:
The next processor for the given CrawlURI in the processor chain.

setDefaultNextProcessor

public void setDefaultNextProcessor(Processor nextProcessor)
Set the default next processor in the chain.

Parameters:
nextProcessor - the default next processor in the chain.

getController

public CrawlController getController()
Get the controller object.

Returns:
the controller object.

spawn

public Processor spawn(int serialNum)

report

public java.lang.String report()
Compiles and returns a report (in human readable form) about the status of the processor. The processor's name (of implementing class) should always be included.

Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.

Returns:
A human readable report on the processor's state.

isContentToProcess

protected boolean isContentToProcess(CrawlURI curi)
Parameters:
curi - CrawlURI to examine.
Returns:
True if content to process -- content length is > 0

isHttpTransactionContentToProcess

protected boolean isHttpTransactionContentToProcess(CrawlURI curi)
Parameters:
curi - CrawlURI to examine.
Returns:
True if isContentToProcess(CrawlURI) and the CrawlURI represents a successful http transaction.

isExpectedMimeType

protected boolean isExpectedMimeType(java.lang.String contentType,
                                     java.lang.String expectedPrefix)
Parameters:
contentType - Found content type.
expectedPrefix - String to find at start of contenttype: e.g. text/html.
Returns:
True if passed content-type begins with expected mimetype.

kickUpdate

public void kickUpdate()

isEnabled

public boolean isEnabled()


Copyright © 2003-2011 Internet Archive. All Rights Reserved.