LexicalCrawlMapper (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.processor
Class LexicalCrawlMapper

java.lang.Object
  javax.management.Attribute
      org.archive.crawler.settings.Type
          org.archive.crawler.settings.ComplexType
              org.archive.crawler.settings.ModuleType
                  org.archive.crawler.framework.Processor
                      org.archive.crawler.processor.CrawlMapper
                          org.archive.crawler.processor.LexicalCrawlMapper

All Implemented Interfaces:: java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes

public class LexicalCrawlMapper
extends CrawlMapper
extends CrawlMapper

A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). May operate on a CrawlURI (typically early in the processing chain) or its CandidateURI outlinks (late in the processing chain, after LinksScoper), or both (if inserted and configured in both places).

Uses lexical comparisons of classKeys to map URIs to crawlers. The 'map' is specified via either a local or HTTP-fetchable file. Each line of this file should contain two space-separated tokens, the first a key and the second a crawler node name (which should be legal as part of a filename). All URIs will be mapped to the crawler node name associated with the nearest mapping key equal or subsequent to the URI's own classKey. If there are no mapping keys equal or after the classKey, the mapping 'wraps around' to the first mapping key.

One crawler name is distinguished as the 'local name'; URIs mapped to this name are not diverted, but continue to be processed normally.

For example, assume a SurtAuthorityQueueAssignmentPolicy and a simple mapping file:

  d crawlerA
  ~ crawlerB

All URIs with "com," classKeys will find the 'd' key as the nearest subsequent mapping key, and thus be mapped to 'crawlerA'. If that's the 'local name', the URIs will be processed normally; otherwise, the URI will be written to a diversion log aimed for 'crawlerA'.

If using the JMX importUris operation importing URLs dropped by a LexicalCrawlMapper instance, use recoveryLog style.

Version:: $Date: 2006-09-26 20:38:48 +0000 (Tue, 26 Sep 2006) $, $Revision: 4667 $
Author:: gojomo
See Also:: Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
`ComplexType.MBeanAttributeInfoIterator`

Field Summary
`static java.lang.String`	`ATTR_MAP_SOURCE` where to load map from
`static java.lang.String`	`DEFAULT_MAP_SOURCE`
`(package private) java.util.TreeMap<java.lang.String,java.lang.String>`	`map` Mapping of classKey ranges (as represented by their start) to crawlers (by abstract name/filename)

Fields inherited from class org.archive.crawler.processor.CrawlMapper
`ATTR_CHECK_OUTLINKS, ATTR_CHECK_URI, ATTR_DIVERSION_DIR, ATTR_LOCAL_NAME, ATTR_MAP_OUTLINK_DECIDE_RULES, ATTR_ROTATION_DIGITS, cache, DEFAULT_CHECK_OUTLINKS, DEFAULT_CHECK_URI, DEFAULT_DIVERSION_DIR, DEFAULT_LOCAL_NAME, DEFAULT_ROTATION_DIGITS, diversionLogs, localName, logGeneration`

Fields inherited from class org.archive.crawler.framework.Processor
`ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules`

Fields inherited from class org.archive.crawler.settings.ComplexType
`definition, definitionMap`

Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE

Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes

S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE

Constructor Summary
`LexicalCrawlMapper(java.lang.String name)` Constructor.

Method Summary
`protected void`	`initialTasks()` Classes subclassing this one should override this method to perform processor specific actions.
`protected void`	`loadMap()` Retrieve and parse the mapping specification from a local path or HTTP URL.
`protected java.lang.String`	`map(CandidateURI cauri)` Look up the crawler node name to which the given CandidateURI should be mapped.

Methods inherited from class org.archive.crawler.processor.CrawlMapper
`decideToMapOutlink, divertLog, getDiversionLog, getMapOutlinkDecideRule, innerProcess, updateGeneration`

Methods inherited from class org.archive.crawler.framework.Processor
`checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn`

Methods inherited from class org.archive.crawler.settings.ModuleType
`addElement, listUsedFiles`

Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.ComplexType

addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.Type
`addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient`

Methods inherited from class javax.management.Attribute
`getName, hashCode`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

ATTR_MAP_SOURCE

public static final java.lang.String ATTR_MAP_SOURCE

where to load map from

See Also:: Constant Field Values

DEFAULT_MAP_SOURCE

public static final java.lang.String DEFAULT_MAP_SOURCE

See Also:: Constant Field Values

map

java.util.TreeMap<java.lang.String,java.lang.String> map

Mapping of classKey ranges (as represented by their start) to crawlers (by abstract name/filename)

Constructor Detail