org.archive.crawler.postprocessor
Class LinksScoper
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.framework.Scoper
org.archive.crawler.postprocessor.LinksScoper
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes
public class LinksScoper
- extends Scoper
- implements FetchStatusCodes
Determine which extracted links are within scope.
TODO: To test scope, requires that Link be converted to
a CandidateURI. Make it so don't have to make a CandidateURI to test
if Link is in scope.
Since this scoper has to create CandidateURIs, no sense
discarding them since later in the processing chain CandidateURIs rather
than Links are whats needed scheduling extracted links w/ the
Frontier (Frontier#schedule expects CandidateURI, not Link). This class
replaces Links w/ the CandidateURI that wraps the Link in the CrawlURI.
- Author:
- gojomo, stack
- See Also:
- Serialized Form
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes |
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
ATTR_REJECTLOG_DECIDE_RULES
public static final java.lang.String ATTR_REJECTLOG_DECIDE_RULES
- See Also:
- Constant Field Values
ATTR_PREFERENCE_DEPTH_HOPS
public static final java.lang.String ATTR_PREFERENCE_DEPTH_HOPS
- See Also:
- Constant Field Values
LinksScoper
public LinksScoper(java.lang.String name)
- Parameters:
name
- Name of this filter.
innerProcess
protected void innerProcess(CrawlURI curi)
- Description copied from class:
Processor
- Classes subclassing this one should override this method to perform
their custom actions on the CrawlURI.
- Overrides:
innerProcess
in class Processor
- Parameters:
curi
- The CrawlURI being processed.
handlePrerequisite
protected void handlePrerequisite(CrawlURI curi)
- The CrawlURI has a prerequisite; apply scoping and update
Link to CandidateURI in manner analogous to outlink handling.
- Parameters:
curi
- CrawlURI with prereq to consider
outOfScope
protected void outOfScope(CandidateURI caUri)
- Description copied from class:
Scoper
- Called when a CandidateUri is ruled out of scope.
Override if you don't want logs as coming from this class.
- Overrides:
outOfScope
in class Scoper
- Parameters:
caUri
- CandidateURI that is out of scope.
getRejectLogRules
protected DecideRule getRejectLogRules(java.lang.Object o)
getSchedulingFor
protected int getSchedulingFor(CrawlURI curi,
Link wref,
int preferenceDepthHops)
- Determine scheduling for the
curi
.
As with the LinksScoper in general, this only handles extracted links,
seeds do not pass through here, but are given MEDIUM priority.
Imports into the frontier similarly do not pass through here,
but are given NORMAL priority.
Copyright © 2003-2011 Internet Archive. All Rights Reserved.