|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javax.management.Attribute org.archive.crawler.settings.Type org.archive.crawler.settings.ComplexType org.archive.crawler.settings.ModuleType org.archive.crawler.framework.Processor org.archive.crawler.processor.CrawlMapper org.archive.crawler.processor.LexicalCrawlMapper
public class LexicalCrawlMapper
A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). May operate on a CrawlURI (typically early in the processing chain) or its CandidateURI outlinks (late in the processing chain, after LinksScoper), or both (if inserted and configured in both places).
Uses lexical comparisons of classKeys to map URIs to crawlers. The 'map' is specified via either a local or HTTP-fetchable file. Each line of this file should contain two space-separated tokens, the first a key and the second a crawler node name (which should be legal as part of a filename). All URIs will be mapped to the crawler node name associated with the nearest mapping key equal or subsequent to the URI's own classKey. If there are no mapping keys equal or after the classKey, the mapping 'wraps around' to the first mapping key.
One crawler name is distinguished as the 'local name'; URIs mapped to this name are not diverted, but continue to be processed normally.
For example, assume a SurtAuthorityQueueAssignmentPolicy and a simple mapping file:
d crawlerA ~ crawlerB
All URIs with "com," classKeys will find the 'd' key as the nearest subsequent mapping key, and thus be mapped to 'crawlerA'. If that's the 'local name', the URIs will be processed normally; otherwise, the URI will be written to a diversion log aimed for 'crawlerA'.
If using the JMX importUris operation importing URLs dropped by
a LexicalCrawlMapper
instance, use recoveryLog
style.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Field Summary | |
---|---|
static java.lang.String |
ATTR_MAP_SOURCE
where to load map from |
static java.lang.String |
DEFAULT_MAP_SOURCE
|
(package private) java.util.TreeMap<java.lang.String,java.lang.String> |
map
Mapping of classKey ranges (as represented by their start) to crawlers (by abstract name/filename) |
Fields inherited from class org.archive.crawler.processor.CrawlMapper |
---|
ATTR_CHECK_OUTLINKS, ATTR_CHECK_URI, ATTR_DIVERSION_DIR, ATTR_LOCAL_NAME, ATTR_MAP_OUTLINK_DECIDE_RULES, ATTR_ROTATION_DIGITS, cache, DEFAULT_CHECK_OUTLINKS, DEFAULT_CHECK_URI, DEFAULT_DIVERSION_DIR, DEFAULT_LOCAL_NAME, DEFAULT_ROTATION_DIGITS, diversionLogs, localName, logGeneration |
Fields inherited from class org.archive.crawler.framework.Processor |
---|
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Constructor Summary | |
---|---|
LexicalCrawlMapper(java.lang.String name)
Constructor. |
Method Summary | |
---|---|
protected void |
initialTasks()
Classes subclassing this one should override this method to perform processor specific actions. |
protected void |
loadMap()
Retrieve and parse the mapping specification from a local path or HTTP URL. |
protected java.lang.String |
map(CandidateURI cauri)
Look up the crawler node name to which the given CandidateURI should be mapped. |
Methods inherited from class org.archive.crawler.processor.CrawlMapper |
---|
decideToMapOutlink, divertLog, getDiversionLog, getMapOutlinkDecideRule, innerProcess, updateGeneration |
Methods inherited from class org.archive.crawler.framework.Processor |
---|
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement, listUsedFiles |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String ATTR_MAP_SOURCE
public static final java.lang.String DEFAULT_MAP_SOURCE
java.util.TreeMap<java.lang.String,java.lang.String> map
Constructor Detail |
---|
public LexicalCrawlMapper(java.lang.String name)
name
- Name of this processor.Method Detail |
---|
protected java.lang.String map(CandidateURI cauri)
map
in class CrawlMapper
cauri
- CandidateURI to consider
protected void initialTasks()
Processor
This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.
initialTasks
in class CrawlMapper
protected void loadMap() throws java.io.IOException
java.io.IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |