org.archive.crawler.processor
Class CrawlMapper

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.processor.CrawlMapper
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes
Direct Known Subclasses:
HashCrawlMapper, LexicalCrawlMapper

public abstract class CrawlMapper
extends Processor
implements FetchStatusCodes

A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). May operate on a CrawlURI (typically early in the processing chain) or its CandidateURI outlinks (late in the processing chain, after LinksScoper), or both (if inserted and configured in both places).

Applies a map() method, supplied by a concrete subclass, to classKeys to map URIs to crawlers by name.

One crawler name is distinguished as the 'local name'; URIs mapped to this name are not diverted, but continue to be processed normally.

If using the JMX importUris operation importing URLs dropped by a CrawlMapper instance, use recoveryLog style.

Version:
$Date: 2007-06-07 21:34:56 +0000 (Thu, 07 Jun 2007) $, $Revision: 5199 $
Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_CHECK_OUTLINKS
          whether to map CrawlURI's outlinks (if CandidateURIs)
static java.lang.String ATTR_CHECK_URI
          whether to map CrawlURI itself (if status nonpositive)
static java.lang.String ATTR_DIVERSION_DIR
          where to log diversions
static java.lang.String ATTR_LOCAL_NAME
          name of local crawler (URIs mapped to here are not diverted)
static java.lang.String ATTR_MAP_OUTLINK_DECIDE_RULES
          decide rules to determine if an outlink is subject to mapping
static java.lang.String ATTR_ROTATION_DIGITS
          rotate logs when change occurs within this # of digits of timestamp
protected  ArrayLongFPCache cache
           
static java.lang.Boolean DEFAULT_CHECK_OUTLINKS
           
static java.lang.Boolean DEFAULT_CHECK_URI
           
static java.lang.String DEFAULT_DIVERSION_DIR
           
static java.lang.String DEFAULT_LOCAL_NAME
           
static java.lang.Integer DEFAULT_ROTATION_DIGITS
           
(package private)  java.util.HashMap<java.lang.String,java.io.PrintWriter> diversionLogs
          Mapping of target crawlers to logs (PrintWriters)
protected  java.lang.String localName
          name of the enclosing crawler (URIs mapped here stay put)
(package private)  java.lang.String logGeneration
          Truncated timestamp prefix for diversion logs; when current time doesn't match, it's time to close all current logs.
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
CrawlMapper(java.lang.String name, java.lang.String description)
          Constructor.
 
Method Summary
protected  boolean decideToMapOutlink(CandidateURI cauri)
           
protected  void divertLog(CandidateURI cauri, java.lang.String target)
          Note the given CandidateURI in the appropriate diversion log.
protected  java.io.PrintWriter getDiversionLog(java.lang.String target)
          Get the diversion log for a given target crawler node node.
protected  DecideRule getMapOutlinkDecideRule(java.lang.Object o)
           
protected  void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
protected abstract  java.lang.String map(CandidateURI cauri)
          Look up the crawler node name to which the given CandidateURI should be mapped.
protected  void updateGeneration(java.lang.String nowGeneration)
          Close and mark as finished all existing diversion logs, and arrange for new logs to use the new generation prefix.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_CHECK_URI

public static final java.lang.String ATTR_CHECK_URI
whether to map CrawlURI itself (if status nonpositive)

See Also:
Constant Field Values

DEFAULT_CHECK_URI

public static final java.lang.Boolean DEFAULT_CHECK_URI

ATTR_CHECK_OUTLINKS

public static final java.lang.String ATTR_CHECK_OUTLINKS
whether to map CrawlURI's outlinks (if CandidateURIs)

See Also:
Constant Field Values

DEFAULT_CHECK_OUTLINKS

public static final java.lang.Boolean DEFAULT_CHECK_OUTLINKS

ATTR_MAP_OUTLINK_DECIDE_RULES

public static final java.lang.String ATTR_MAP_OUTLINK_DECIDE_RULES
decide rules to determine if an outlink is subject to mapping

See Also:
Constant Field Values

ATTR_LOCAL_NAME

public static final java.lang.String ATTR_LOCAL_NAME
name of local crawler (URIs mapped to here are not diverted)

See Also:
Constant Field Values

DEFAULT_LOCAL_NAME

public static final java.lang.String DEFAULT_LOCAL_NAME
See Also:
Constant Field Values

ATTR_DIVERSION_DIR

public static final java.lang.String ATTR_DIVERSION_DIR
where to log diversions

See Also:
Constant Field Values

DEFAULT_DIVERSION_DIR

public static final java.lang.String DEFAULT_DIVERSION_DIR
See Also:
Constant Field Values

ATTR_ROTATION_DIGITS

public static final java.lang.String ATTR_ROTATION_DIGITS
rotate logs when change occurs within this # of digits of timestamp

See Also:
Constant Field Values

DEFAULT_ROTATION_DIGITS

public static final java.lang.Integer DEFAULT_ROTATION_DIGITS

diversionLogs

java.util.HashMap<java.lang.String,java.io.PrintWriter> diversionLogs
Mapping of target crawlers to logs (PrintWriters)


logGeneration

java.lang.String logGeneration
Truncated timestamp prefix for diversion logs; when current time doesn't match, it's time to close all current logs.


localName

protected java.lang.String localName
name of the enclosing crawler (URIs mapped here stay put)


cache

protected ArrayLongFPCache cache
Constructor Detail

CrawlMapper

public CrawlMapper(java.lang.String name,
                   java.lang.String description)
Constructor.

Parameters:
name - Name of this processor.
Method Detail

innerProcess

protected void innerProcess(CrawlURI curi)
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.

decideToMapOutlink

protected boolean decideToMapOutlink(CandidateURI cauri)

getMapOutlinkDecideRule

protected DecideRule getMapOutlinkDecideRule(java.lang.Object o)

updateGeneration

protected void updateGeneration(java.lang.String nowGeneration)
Close and mark as finished all existing diversion logs, and arrange for new logs to use the new generation prefix.

Parameters:
nowGeneration - new generation (timestamp prefix) to use

map

protected abstract java.lang.String map(CandidateURI cauri)
Look up the crawler node name to which the given CandidateURI should be mapped.

Parameters:
cauri - CandidateURI to consider
Returns:
String node name which should handle URI

divertLog

protected void divertLog(CandidateURI cauri,
                         java.lang.String target)
Note the given CandidateURI in the appropriate diversion log.

Parameters:
cauri - CandidateURI to append to a diversion log
target - String node name (log name) to receive URI

getDiversionLog

protected java.io.PrintWriter getDiversionLog(java.lang.String target)
Get the diversion log for a given target crawler node node.

Parameters:
target - crawler node name of requested log
Returns:
PrintWriter open on an appropriately-named log file

initialTasks

protected void initialTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.

Overrides:
initialTasks in class Processor


Copyright © 2003-2011 Internet Archive. All Rights Reserved.