org.archive.crawler.postprocessor
Class Postselector
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.postprocessor.Postselector
- All Implemented Interfaces:
- CoreAttributeConstants, javax.management.DynamicMBean, FetchStatusCodes, java.io.Serializable
- public class Postselector
- extends Processor
- implements CoreAttributeConstants, FetchStatusCodes
Determine which extracted links etc get fed back into Frontier.
Could in the future also control whether current URI is retried.
- Author:
- gojomo
- See Also:
- Serialized Form
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
A_ANNOTATIONS, A_CONTENT_TYPE, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_HTML_BASE, A_HTTP_TRANSACTION, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION |
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes |
S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, filtersAccept, filtersAccept, finalTasks, getController, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isExpectedMimeType, isHttpTransactionContentToProcess, process, report, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ATTR_LOG_REJECTS_ENABLED
public static final java.lang.String ATTR_LOG_REJECTS_ENABLED
- See Also:
- Constant Field Values
ATTR_LOG_REJECT_FILTERS
public static final java.lang.String ATTR_LOG_REJECT_FILTERS
- See Also:
- Constant Field Values
ATTR_SCHEDULE_EMBEDDED_LINKS
public static final java.lang.String ATTR_SCHEDULE_EMBEDDED_LINKS
- See Also:
- Constant Field Values
Postselector
public Postselector(java.lang.String name)
- Parameters:
name
- Name of this filter.
initialTasks
protected void initialTasks()
- Description copied from class:
Processor
- Classes subclassing this one should override this method to perform
processor specific actions.
This method is garanteed to be called after the crawl is set up, but
before any URI-processing has occured.
- Overrides:
initialTasks
in class Processor
innerProcess
protected void innerProcess(CrawlURI curi)
- Description copied from class:
Processor
- Classes subclassing this one should override this method to perform
their custom actions on the CrawlURI.
- Overrides:
innerProcess
in class Processor
- Parameters:
curi
- The CrawlURI being processed.
handlePrerequisites
protected void handlePrerequisites(CrawlURI curi)
schedule
protected boolean schedule(CandidateURI caUri)
- Schedule the given
CandidateURI
with the Frontier.
- Parameters:
caUri
- The CandidateURI to be scheduled.
- Returns:
- true if CandidateURI was accepted by crawl scope, false
otherwise.
isOverrideEnabled
public boolean isOverrideEnabled(java.lang.Object context)
createCandidateURI
protected CandidateURI createCandidateURI(CrawlURI curi,
Link link)
throws org.apache.commons.httpclient.URIException
- Throws:
org.apache.commons.httpclient.URIException
Copyright © 2003-2005 Internet Archive. All Rights Reserved.