org.archive.crawler.postprocessor
Class LinksScoper

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.framework.Scoper
                          extended by org.archive.crawler.postprocessor.LinksScoper
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes

public class LinksScoper
extends Scoper
implements FetchStatusCodes

Determine which extracted links are within scope. TODO: To test scope, requires that Link be converted to a CandidateURI. Make it so don't have to make a CandidateURI to test if Link is in scope.

Since this scoper has to create CandidateURIs, no sense discarding them since later in the processing chain CandidateURIs rather than Links are whats needed scheduling extracted links w/ the Frontier (Frontier#schedule expects CandidateURI, not Link). This class replaces Links w/ the CandidateURI that wraps the Link in the CrawlURI.

Author:
gojomo, stack
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_PREFERENCE_DEPTH_HOPS
           
static java.lang.String ATTR_REJECTLOG_DECIDE_RULES
           
 
Fields inherited from class org.archive.crawler.framework.Scoper
ATTR_OVERRIDE_LOGGER_ENABLED
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
LinksScoper(java.lang.String name)
           
 
Method Summary
protected  DecideRule getRejectLogRules(java.lang.Object o)
           
protected  int getSchedulingFor(CrawlURI curi, Link wref, int preferenceDepthHops)
          Determine scheduling for the curi.
protected  void handlePrerequisite(CrawlURI curi)
          The CrawlURI has a prerequisite; apply scoping and update Link to CandidateURI in manner analogous to outlink handling.
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
protected  void outOfScope(CandidateURI caUri)
          Called when a CandidateUri is ruled out of scope.
 
Methods inherited from class org.archive.crawler.framework.Scoper
finalTasks, initialTasks, isInScope, isOverrideLogger
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_REJECTLOG_DECIDE_RULES

public static final java.lang.String ATTR_REJECTLOG_DECIDE_RULES
See Also:
Constant Field Values

ATTR_PREFERENCE_DEPTH_HOPS

public static final java.lang.String ATTR_PREFERENCE_DEPTH_HOPS
See Also:
Constant Field Values
Constructor Detail

LinksScoper

public LinksScoper(java.lang.String name)
Parameters:
name - Name of this filter.
Method Detail

innerProcess

protected void innerProcess(CrawlURI curi)
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.

handlePrerequisite

protected void handlePrerequisite(CrawlURI curi)
The CrawlURI has a prerequisite; apply scoping and update Link to CandidateURI in manner analogous to outlink handling.

Parameters:
curi - CrawlURI with prereq to consider

outOfScope

protected void outOfScope(CandidateURI caUri)
Description copied from class: Scoper
Called when a CandidateUri is ruled out of scope. Override if you don't want logs as coming from this class.

Overrides:
outOfScope in class Scoper
Parameters:
caUri - CandidateURI that is out of scope.

getRejectLogRules

protected DecideRule getRejectLogRules(java.lang.Object o)

getSchedulingFor

protected int getSchedulingFor(CrawlURI curi,
                               Link wref,
                               int preferenceDepthHops)
Determine scheduling for the curi. As with the LinksScoper in general, this only handles extracted links, seeds do not pass through here, but are given MEDIUM priority. Imports into the frontier similarly do not pass through here, but are given NORMAL priority.



Copyright © 2003-2011 Internet Archive. All Rights Reserved.