org.archive.crawler.postprocessor
Class Postselector

java.lang.Object
  extended byjavax.management.Attribute
      extended byorg.archive.crawler.settings.Type
          extended byorg.archive.crawler.settings.ComplexType
              extended byorg.archive.crawler.settings.ModuleType
                  extended byorg.archive.crawler.framework.Processor
                      extended byorg.archive.crawler.postprocessor.Postselector
All Implemented Interfaces:
CoreAttributeConstants, javax.management.DynamicMBean, FetchStatusCodes, java.io.Serializable

public class Postselector
extends Processor
implements CoreAttributeConstants, FetchStatusCodes

Determine which extracted links etc get fed back into Frontier. Could in the future also control whether current URI is retried.

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_LOG_REJECT_FILTERS
           
static java.lang.String ATTR_LOG_REJECTS_ENABLED
           
static java.lang.String ATTR_SCHEDULE_EMBEDDED_LINKS
           
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_ENABLED, ATTR_FILTERS
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_TYPE, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_HTML_BASE, A_HTTP_TRANSACTION, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
Postselector(java.lang.String name)
           
 
Method Summary
protected  CandidateURI createCandidateURI(CrawlURI curi, Link link)
           
protected  void handlePrerequisites(CrawlURI curi)
           
protected  void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
 boolean isOverrideEnabled(java.lang.Object context)
           
protected  boolean schedule(CandidateURI caUri)
          Schedule the given CandidateURI with the Frontier.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, filtersAccept, filtersAccept, finalTasks, getController, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isExpectedMimeType, isHttpTransactionContentToProcess, process, report, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ATTR_LOG_REJECTS_ENABLED

public static final java.lang.String ATTR_LOG_REJECTS_ENABLED
See Also:
Constant Field Values

ATTR_LOG_REJECT_FILTERS

public static final java.lang.String ATTR_LOG_REJECT_FILTERS
See Also:
Constant Field Values

ATTR_SCHEDULE_EMBEDDED_LINKS

public static final java.lang.String ATTR_SCHEDULE_EMBEDDED_LINKS
See Also:
Constant Field Values
Constructor Detail

Postselector

public Postselector(java.lang.String name)
Parameters:
name - Name of this filter.
Method Detail

initialTasks

protected void initialTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.

Overrides:
initialTasks in class Processor

innerProcess

protected void innerProcess(CrawlURI curi)
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.

handlePrerequisites

protected void handlePrerequisites(CrawlURI curi)

schedule

protected boolean schedule(CandidateURI caUri)
Schedule the given CandidateURI with the Frontier.

Parameters:
caUri - The CandidateURI to be scheduled.
Returns:
true if CandidateURI was accepted by crawl scope, false otherwise.

isOverrideEnabled

public boolean isOverrideEnabled(java.lang.Object context)

createCandidateURI

protected CandidateURI createCandidateURI(CrawlURI curi,
                                          Link link)
                                   throws org.apache.commons.httpclient.URIException
Throws:
org.apache.commons.httpclient.URIException


Copyright © 2003-2005 Internet Archive. All Rights Reserved.