org.archive.crawler.processor
Class LexicalCrawlMapper

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.processor.CrawlMapper
                          extended by org.archive.crawler.processor.LexicalCrawlMapper
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes

public class LexicalCrawlMapper
extends CrawlMapper

A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). May operate on a CrawlURI (typically early in the processing chain) or its CandidateURI outlinks (late in the processing chain, after LinksScoper), or both (if inserted and configured in both places).

Uses lexical comparisons of classKeys to map URIs to crawlers. The 'map' is specified via either a local or HTTP-fetchable file. Each line of this file should contain two space-separated tokens, the first a key and the second a crawler node name (which should be legal as part of a filename). All URIs will be mapped to the crawler node name associated with the nearest mapping key equal or subsequent to the URI's own classKey. If there are no mapping keys equal or after the classKey, the mapping 'wraps around' to the first mapping key.

One crawler name is distinguished as the 'local name'; URIs mapped to this name are not diverted, but continue to be processed normally.

For example, assume a SurtAuthorityQueueAssignmentPolicy and a simple mapping file:

  d crawlerA
  ~ crawlerB
 

All URIs with "com," classKeys will find the 'd' key as the nearest subsequent mapping key, and thus be mapped to 'crawlerA'. If that's the 'local name', the URIs will be processed normally; otherwise, the URI will be written to a diversion log aimed for 'crawlerA'.

If using the JMX importUris operation importing URLs dropped by a LexicalCrawlMapper instance, use recoveryLog style.

Version:
$Date: 2006-09-26 20:38:48 +0000 (Tue, 26 Sep 2006) $, $Revision: 4667 $
Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_MAP_SOURCE
          where to load map from
static java.lang.String DEFAULT_MAP_SOURCE
           
(package private)  java.util.TreeMap<java.lang.String,java.lang.String> map
          Mapping of classKey ranges (as represented by their start) to crawlers (by abstract name/filename)
 
Fields inherited from class org.archive.crawler.processor.CrawlMapper
ATTR_CHECK_OUTLINKS, ATTR_CHECK_URI, ATTR_DIVERSION_DIR, ATTR_LOCAL_NAME, ATTR_MAP_OUTLINK_DECIDE_RULES, ATTR_ROTATION_DIGITS, cache, DEFAULT_CHECK_OUTLINKS, DEFAULT_CHECK_URI, DEFAULT_DIVERSION_DIR, DEFAULT_LOCAL_NAME, DEFAULT_ROTATION_DIGITS, diversionLogs, localName, logGeneration
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
LexicalCrawlMapper(java.lang.String name)
          Constructor.
 
Method Summary
protected  void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected  void loadMap()
          Retrieve and parse the mapping specification from a local path or HTTP URL.
protected  java.lang.String map(CandidateURI cauri)
          Look up the crawler node name to which the given CandidateURI should be mapped.
 
Methods inherited from class org.archive.crawler.processor.CrawlMapper
decideToMapOutlink, divertLog, getDiversionLog, getMapOutlinkDecideRule, innerProcess, updateGeneration
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_MAP_SOURCE

public static final java.lang.String ATTR_MAP_SOURCE
where to load map from

See Also:
Constant Field Values

DEFAULT_MAP_SOURCE

public static final java.lang.String DEFAULT_MAP_SOURCE
See Also:
Constant Field Values

map

java.util.TreeMap<java.lang.String,java.lang.String> map
Mapping of classKey ranges (as represented by their start) to crawlers (by abstract name/filename)

Constructor Detail

LexicalCrawlMapper

public LexicalCrawlMapper(java.lang.String name)
Constructor.

Parameters:
name - Name of this processor.
Method Detail

map

protected java.lang.String map(CandidateURI cauri)
Look up the crawler node name to which the given CandidateURI should be mapped.

Specified by:
map in class CrawlMapper
Parameters:
cauri - CandidateURI to consider
Returns:
String node name which should handle URI

initialTasks

protected void initialTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.

Overrides:
initialTasks in class CrawlMapper

loadMap

protected void loadMap()
                throws java.io.IOException
Retrieve and parse the mapping specification from a local path or HTTP URL.

Throws:
java.io.IOException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.