org.archive.crawler.frontier
Class AdaptiveRevisitFrontier

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.frontier.AdaptiveRevisitFrontier
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, UriUniqFilter.HasUriReceiver, CrawlStatusListener, Frontier, AdaptiveRevisitAttributeConstants, Reporter

public class AdaptiveRevisitFrontier
extends ModuleType
implements Frontier, FetchStatusCodes, CoreAttributeConstants, AdaptiveRevisitAttributeConstants, CrawlStatusListener, UriUniqFilter.HasUriReceiver

A Frontier that will repeatedly visit all encountered URIs.

Wait time between visits is configurable and varies based on observed changes of documents.

The Frontier borrows many things from HostQueuesFrontier, but implements an entirely different strategy in issuing URIs and consequently in keeping a record of discovered URIs.

Author:
Kristinn Sigurdsson
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier
Frontier.FrontierGroup
 
Field Summary
protected static java.lang.String ACCEPTABLE_FORCE_QUEUE
          Acceptable characters in forced queue names.
static java.lang.String ATTR_DELAY_FACTOR
          How many multiples of last fetch elapsed time to wait before recontacting same server
static java.lang.String ATTR_FORCE_QUEUE
          Queue assignment to force on CrawlURIs.
static java.lang.String ATTR_HOST_VALENCE
          Maximum simultaneous requests in process to a host (queue)
static java.lang.String ATTR_MAX_DELAY
          Never wait more than this long, regardless of multiple
static java.lang.String ATTR_MAX_RETRIES
          Maximum times to emit a CrawlURI without final disposition
static java.lang.String ATTR_MIN_DELAY
          Always wait this long after one completion before recontacting same server, regardless of multiple
static java.lang.String ATTR_PREFERENCE_EMBED_HOPS
          Number of hops of embeds (ERX) to bump to front of host queue
static java.lang.String ATTR_QUEUE_ASSIGNMENT_POLICY
          The Class to use for QueueAssignmentPolicy
static java.lang.String ATTR_QUEUE_IGNORE_WWW
          Should the queue assignment ignore www in hostnames, effectively stripping them away.
static java.lang.String ATTR_RETRY_DELAY
          For retryable problems, seconds to wait before a retry
static java.lang.String ATTR_USE_URI_UNIQ_FILTER
          Should the Frontier use a seperate 'already included' datastructure or rely on the queues'.
protected static java.lang.String DEFAULT_FORCE_QUEUE
           
protected static java.lang.String DEFAULT_QUEUE_ASSIGNMENT_POLICY
           
protected static java.lang.Boolean DEFAULT_QUEUE_IGNORE_WWW
           
protected static java.lang.Boolean DEFAULT_USE_URI_UNIQ_FILTER
           
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.framework.Frontier
ATTR_NAME
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Fields inherited from interface org.archive.crawler.frontier.AdaptiveRevisitAttributeConstants
A_CONTENT_STATE_KEY, A_DISCARD_REVISIT, A_FETCH_OVERDUE, A_LAST_CONTENT_DIGEST, A_LAST_DATESTAMP, A_LAST_ETAG, A_NUMBER_OF_VERSIONS, A_NUMBER_OF_VISITS, A_TIME_OF_NEXT_PROCESSING, A_WAIT_INTERVAL, A_WAIT_REEVALUATED, CONTENT_CHANGED, CONTENT_UNCHANGED, CONTENT_UNKNOWN
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
AdaptiveRevisitFrontier(java.lang.String name)
           
AdaptiveRevisitFrontier(java.lang.String name, java.lang.String description)
           
 
Method Summary
 long averageDepth()
           
protected  void batchFlush()
           
protected  void batchSchedule(CandidateURI caUri)
           
protected  long calculateSnoozeTime(CrawlURI curi)
          Calculates how long a host queue needs to be snoozed following the crawling of a URI.
protected  java.lang.String canonicalize(CandidateURI cauri)
          Canonicalize passed CandidateURI.
protected  java.lang.String canonicalize(UURI uuri)
          Canonicalize passed uuri.
 float congestionRatio()
           
 void considerIncluded(UURI u)
          Notify Frontier that it should consider the given UURI as if already scheduled.
 void crawlCheckpoint(java.io.File checkpointDir)
          Called by CrawlController when checkpointing.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
 void crawlEnding(java.lang.String sExitMessage)
          Called when a CrawlController is ending a crawl (for any reason)
 void crawlPaused(java.lang.String statusMessage)
          Called when a CrawlController is actually paused (all threads are idle).
 void crawlPausing(java.lang.String statusMessage)
          Called when a CrawlController is going to be paused.
 void crawlResuming(java.lang.String statusMessage)
          Called when a CrawlController is resuming a crawl that had been paused.
 void crawlStarted(java.lang.String message)
          Called on crawl start.
protected  UriUniqFilter createAlreadyIncluded()
          Create a UriUniqFilter that will serve as record of already seen URIs.
 long deepestUri()
           
 void deleted(CrawlURI curi)
          Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
 long deleteURIs(java.lang.String match)
          Delete any URI that matches the given regular expression from the list of discovered and pending URIs.
 long deleteURIs(java.lang.String uriMatch, java.lang.String queueMatch)
          Delete any URI that matches the given regular expression from the list of discovered and pending URIs, if it is in a queue with a name matching the second regular expression.
 long discoveredUriCount()
          Number of discovered URIs.
protected  void disregardDisposition(CrawlURI curi)
           
 long disregardedUriCount()
          Number of URIs that were scheduled at one point but have been disregarded.
 long failedFetchCount()
          Number of URIs that failed to process.
protected  void failureDisposition(CrawlURI curi)
          The CrawlURI has encountered a problem, and will not be retried.
 void finalTasks()
          Perform any final tasks *before* notification crawl has reached 'FINISHED' status.
 void finished(CrawlURI curi)
          Report a URI being processed as having finished processing.
 long finishedUriCount()
          Number of URIs that have finished processing.
 java.lang.String getClassKey(CandidateURI cauri)
           
 FrontierJournal getFrontierJournal()
           
 Frontier.FrontierGroup getGroup(CrawlURI curi)
          Get the 'frontier group' (usually queue) for the given CrawlURI.
protected  AdaptiveRevisitHostQueue getHQ(CrawlURI curi)
          Get the AdaptiveRevisitHostQueue for the given CrawlURI, creating it if necessary.
 FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
          Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.
 java.lang.String[] getReports()
          Get an array of report names offered by this Reporter.
protected  CrawlServer getServer(CrawlURI curi)
           
 java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
          Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.
 void importRecoverLog(java.lang.String pathToLog)
          Method is not supported by this Frontier implementation..
 void importRecoverLog(java.lang.String pathToLog, boolean retainFailures)
          This method is not supported by this Frontier implementation
 void initialize(CrawlController c)
          Initialize the Frontier.
protected  void innerFinished(CrawlURI curi)
           
protected  void innerSchedule(CandidateURI caUri)
           
protected  boolean isDisregarded(CrawlURI curi)
           
 boolean isEmpty()
          Returns true if the frontier contains no more URIs to crawl.
 void kickUpdate()
          Notify Frontier that it should consider updating configuration info that may have changed in external files.
 void loadSeeds()
          Loads the seeds
protected  boolean needsPromptRetry(CrawlURI curi)
          Checks if a recently completed CrawlURI that did not finish successfully needs to be retried immediately (processed again as soon as politeness allows.)
protected  boolean needsRetrying(CrawlURI curi)
          Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)
 CrawlURI next()
          Get the next URI that should be processed.
 void pause()
          Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
 long queuedUriCount()
          Number of URIs queued up and waiting for processing.
 void receive(CandidateURI item)
           
 void reportTo(java.io.PrintWriter writer)
          Make a default report to the passed-in Writer.
 void reportTo(java.lang.String name, java.io.PrintWriter writer)
          Make a report of the given name to the passed-in Writer, If null, give the default report.
protected  void reschedule(CrawlURI curi, boolean errorWait)
          Put near top of relevant hostQueue (but behind anything recently scheduled 'high')..
 void schedule(CandidateURI caURI)
          Schedules a CandidateURI.
protected  boolean shouldBeForgotten(CrawlURI curi)
          Some URIs, if they recur, deserve another chance at consideration: they might not be too many hops away via another path, or the scope may have been updated to allow them passage.
 java.lang.String singleLineLegend()
          Return a legend for the single-line summary report as a String.
 java.lang.String singleLineReport()
          Return a short single-line summary report as a String.
 void singleLineReportTo(java.io.PrintWriter w)
          Make a single-line summary report to the passed-in writer
 void start()
          Request that Frontier allow crawling to begin.
 long succeededFetchCount()
          Number of successfully processed URIs.
protected  void successDisposition(CrawlURI curi)
          The CrawlURI has been successfully crawled.
 void terminate()
          Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
 long totalBytesWritten()
          Total number of bytes contained in all URIs that have been processed.
 void unpause()
          Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_DELAY_FACTOR

public static final java.lang.String ATTR_DELAY_FACTOR
How many multiples of last fetch elapsed time to wait before recontacting same server

See Also:
Constant Field Values

ATTR_MIN_DELAY

public static final java.lang.String ATTR_MIN_DELAY
Always wait this long after one completion before recontacting same server, regardless of multiple

See Also:
Constant Field Values

ATTR_MAX_DELAY

public static final java.lang.String ATTR_MAX_DELAY
Never wait more than this long, regardless of multiple

See Also:
Constant Field Values

ATTR_MAX_RETRIES

public static final java.lang.String ATTR_MAX_RETRIES
Maximum times to emit a CrawlURI without final disposition

See Also:
Constant Field Values

ATTR_RETRY_DELAY

public static final java.lang.String ATTR_RETRY_DELAY
For retryable problems, seconds to wait before a retry

See Also:
Constant Field Values

ATTR_HOST_VALENCE

public static final java.lang.String ATTR_HOST_VALENCE
Maximum simultaneous requests in process to a host (queue)

See Also:
Constant Field Values

ATTR_PREFERENCE_EMBED_HOPS

public static final java.lang.String ATTR_PREFERENCE_EMBED_HOPS
Number of hops of embeds (ERX) to bump to front of host queue

See Also:
Constant Field Values

ATTR_FORCE_QUEUE

public static final java.lang.String ATTR_FORCE_QUEUE
Queue assignment to force on CrawlURIs. Intended to be used via overrides

See Also:
Constant Field Values

DEFAULT_FORCE_QUEUE

protected static final java.lang.String DEFAULT_FORCE_QUEUE
See Also:
Constant Field Values

ACCEPTABLE_FORCE_QUEUE

protected static final java.lang.String ACCEPTABLE_FORCE_QUEUE
Acceptable characters in forced queue names. Word chars, dash, period, comma, colon

See Also:
Constant Field Values

ATTR_QUEUE_IGNORE_WWW

public static final java.lang.String ATTR_QUEUE_IGNORE_WWW
Should the queue assignment ignore www in hostnames, effectively stripping them away.

See Also:
Constant Field Values

DEFAULT_QUEUE_IGNORE_WWW

protected static final java.lang.Boolean DEFAULT_QUEUE_IGNORE_WWW

ATTR_USE_URI_UNIQ_FILTER

public static final java.lang.String ATTR_USE_URI_UNIQ_FILTER
Should the Frontier use a seperate 'already included' datastructure or rely on the queues'.

See Also:
Constant Field Values

DEFAULT_USE_URI_UNIQ_FILTER

protected static final java.lang.Boolean DEFAULT_USE_URI_UNIQ_FILTER

ATTR_QUEUE_ASSIGNMENT_POLICY

public static final java.lang.String ATTR_QUEUE_ASSIGNMENT_POLICY
The Class to use for QueueAssignmentPolicy

See Also:
Constant Field Values

DEFAULT_QUEUE_ASSIGNMENT_POLICY

protected static final java.lang.String DEFAULT_QUEUE_ASSIGNMENT_POLICY
Constructor Detail

AdaptiveRevisitFrontier

public AdaptiveRevisitFrontier(java.lang.String name)

AdaptiveRevisitFrontier

public AdaptiveRevisitFrontier(java.lang.String name,
                               java.lang.String description)
Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Description copied from interface: Frontier
Initialize the Frontier.

This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.

Specified by:
initialize in interface Frontier
Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.

createAlreadyIncluded

protected UriUniqFilter createAlreadyIncluded()
                                       throws java.io.IOException
Create a UriUniqFilter that will serve as record of already seen URIs.

Returns:
A UURISet that will serve as a record of already seen URIs
Throws:
java.io.IOException

loadSeeds

public void loadSeeds()
Loads the seeds

This method is called by initialize() and kickUpdate()

Specified by:
loadSeeds in interface Frontier

getClassKey

public java.lang.String getClassKey(CandidateURI cauri)
Specified by:
getClassKey in interface Frontier
Parameters:
cauri - CandidateURI for which we're to calculate and set class key.
Returns:
Classkey for cauri.

canonicalize

protected java.lang.String canonicalize(UURI uuri)
Canonicalize passed uuri. Its would be sweeter if this canonicalize function was encapsulated by that which it canonicalizes but because settings change with context -- i.e. there may be overrides in operation for a particular URI -- its not so easy; Each CandidateURI would need a reference to the settings system. That's awkward to pass in.

Parameters:
uuri - Candidate URI to canonicalize.
Returns:
Canonicalized version of passed uuri.

canonicalize

protected java.lang.String canonicalize(CandidateURI cauri)
Canonicalize passed CandidateURI. This method differs from canonicalize(UURI) in that it takes a look at the CandidateURI context possibly overriding any canonicalization effect if it could make us miss content. If canonicalization produces an URL that was 'alreadyseen', but the entry in the 'alreadyseen' database did nothing but redirect to the current URL, we won't get the current URL; we'll think we've already see it. Examples would be archive.org redirecting to www.archive.org or the inverse, www.netarkivet.net redirecting to netarkivet.net (assuming stripWWW rule enabled).

Note, this method under circumstance sets the forceFetch flag.

Parameters:
cauri - CandidateURI to examine.
Returns:
Canonicalized cacuri.

innerSchedule

protected void innerSchedule(CandidateURI caUri)
Parameters:
caUri - The URI to schedule.

getHQ

protected AdaptiveRevisitHostQueue getHQ(CrawlURI curi)
                                  throws java.io.IOException
Get the AdaptiveRevisitHostQueue for the given CrawlURI, creating it if necessary.

Parameters:
curi - CrawlURI for which to get a queue
Returns:
AdaptiveRevisitHostQueue for given CrawlURI
Throws:
java.io.IOException

batchSchedule

protected void batchSchedule(CandidateURI caUri)

batchFlush

protected void batchFlush()

getServer

protected CrawlServer getServer(CrawlURI curi)
Parameters:
curi -
Returns:
the CrawlServer to be associated with this CrawlURI

next

public CrawlURI next()
              throws java.lang.InterruptedException,
                     EndedException
Description copied from interface: Frontier
Get the next URI that should be processed. If no URI becomes availible during the time specified null will be returned.

Specified by:
next in interface Frontier
Returns:
the next URI that should be processed.
Throws:
java.lang.InterruptedException
EndedException

isEmpty

public boolean isEmpty()
Description copied from interface: Frontier
Returns true if the frontier contains no more URIs to crawl.

That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI.

Specified by:
isEmpty in interface Frontier
Returns:
true if the frontier contains no more URIs to crawl.

schedule

public void schedule(CandidateURI caURI)
Description copied from interface: Frontier
Schedules a CandidateURI.

This method accepts one URI and schedules it immediately. This has nothing to do with the priority of the URI being scheduled. Only that it will be placed in it's respective queue at once. For priority scheduling see CandidateURI.setSchedulingDirective(int)

This method should be synchronized in all implementing classes.

Specified by:
schedule in interface Frontier
Parameters:
caURI - The URI to schedule.
See Also:
CandidateURI.setSchedulingDirective(int)

finished

public void finished(CrawlURI curi)
Description copied from interface: Frontier
Report a URI being processed as having finished processing.

ToeThreads will invoke this method once they have completed work on their assigned URI.

This method is synchronized.

Specified by:
finished in interface Frontier
Parameters:
curi - The URI that has finished processing.

innerFinished

protected void innerFinished(CrawlURI curi)

successDisposition

protected void successDisposition(CrawlURI curi)
The CrawlURI has been successfully crawled.

Parameters:
curi - The CrawlURI

reschedule

protected void reschedule(CrawlURI curi,
                          boolean errorWait)
                   throws javax.management.AttributeNotFoundException
Put near top of relevant hostQueue (but behind anything recently scheduled 'high')..

Parameters:
curi - CrawlURI to reschedule. Its time of next processing is not modified.
errorWait - signals if there should be a wait before retrying.
Throws:
javax.management.AttributeNotFoundException

failureDisposition

protected void failureDisposition(CrawlURI curi)
The CrawlURI has encountered a problem, and will not be retried.

Parameters:
curi - The CrawlURI

disregardDisposition

protected void disregardDisposition(CrawlURI curi)

shouldBeForgotten

protected boolean shouldBeForgotten(CrawlURI curi)
Some URIs, if they recur, deserve another chance at consideration: they might not be too many hops away via another path, or the scope may have been updated to allow them passage.

Parameters:
curi -
Returns:
True if curi should be forgotten.

needsPromptRetry

protected boolean needsPromptRetry(CrawlURI curi)
                            throws javax.management.AttributeNotFoundException
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried immediately (processed again as soon as politeness allows.)

Parameters:
curi - The CrawlURI to check
Returns:
True if we need to retry promptly.
Throws:
javax.management.AttributeNotFoundException - If problems occur trying to read the maximum number of retries from the settings framework.

needsRetrying

protected boolean needsRetrying(CrawlURI curi)
                         throws javax.management.AttributeNotFoundException
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)

Parameters:
curi - The CrawlURI to check
Returns:
True if we need to retry.
Throws:
javax.management.AttributeNotFoundException - If problems occur trying to read the maximum number of retries from the settings framework.

isDisregarded

protected boolean isDisregarded(CrawlURI curi)

calculateSnoozeTime

protected long calculateSnoozeTime(CrawlURI curi)
Calculates how long a host queue needs to be snoozed following the crawling of a URI.

Parameters:
curi - The CrawlURI
Returns:
How long to snooze.

discoveredUriCount

public long discoveredUriCount()
Description copied from interface: Frontier
Number of discovered URIs.

That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.

Specified by:
discoveredUriCount in interface Frontier
Returns:
Number of discovered URIs.

queuedUriCount

public long queuedUriCount()
Description copied from interface: Frontier
Number of URIs queued up and waiting for processing.

This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.

Specified by:
queuedUriCount in interface Frontier
Returns:
Number of queued URIs.

finishedUriCount

public long finishedUriCount()
Description copied from interface: Frontier
Number of URIs that have finished processing.

Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Specified by:
finishedUriCount in interface Frontier
Returns:
Number of finished URIs.

succeededFetchCount

public long succeededFetchCount()
Description copied from interface: Frontier
Number of successfully processed URIs.

Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.

Specified by:
succeededFetchCount in interface Frontier
Returns:
Number of successfully processed URIs.

failedFetchCount

public long failedFetchCount()
Description copied from interface: Frontier
Number of URIs that failed to process.

URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.

Specified by:
failedFetchCount in interface Frontier
Returns:
Number of URIs that failed to process.

disregardedUriCount

public long disregardedUriCount()
Description copied from interface: Frontier
Number of URIs that were scheduled at one point but have been disregarded.

Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.

Specified by:
disregardedUriCount in interface Frontier
Returns:
The number of URIs that have been disregarded.

totalBytesWritten

public long totalBytesWritten()
Description copied from interface: Frontier
Total number of bytes contained in all URIs that have been processed.

Specified by:
totalBytesWritten in interface Frontier
Returns:
The total amounts of bytes in all processed URIs.

importRecoverLog

public void importRecoverLog(java.lang.String pathToLog)
                      throws java.io.IOException
Method is not supported by this Frontier implementation..

Parameters:
pathToLog -
Throws:
java.io.IOException

getInitialMarker

public FrontierMarker getInitialMarker(java.lang.String regexpr,
                                       boolean inCacheOnly)
Description copied from interface: Frontier
Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.

Specified by:
getInitialMarker in interface Frontier
Parameters:
regexpr - The regular expression that URIs within the frontier must match to be considered within the scope of this marker
inCacheOnly - If set to true, only those URIs within the frontier that are stored in cache (usually this means in memory rather then on disk, but that is an implementation detail) will be considered. Others will be entierly ignored, as if they dont exist. This is usefull for quick peeks at the top of the URI list.
Returns:
A URIFrontierMarker that is set for the 'start' of the frontier's URI list.

getURIsList

public java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker,
                                                         int numberOfMatches,
                                                         boolean verbose)
                                                  throws InvalidFrontierMarkerException
Description copied from interface: Frontier
Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.

The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).

The URIFrontierMarker will be advanced to the position at which it's maximum number of matches found is reached. Reusing it for subsequent calls will thus effectively get the 'next' batch. Making any changes to the frontier can invalidate the marker.

While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Specified by:
getURIsList in interface Frontier
Parameters:
marker - A marker specifing from what position in the Frontier the list should begin.
numberOfMatches - how many URIs to add at most to the list before returning it
verbose - if set to true the strings returned will contain additional information about each URI beyond their names.
Returns:
a list of all pending URIs falling within the specification of the marker
Throws:
InvalidFrontierMarkerException - when the URIFronterMarker does not match the internal state of the frontier. Tolerance for this can vary considerably from one URIFrontier implementation to the next.
See Also:
FrontierMarker, Frontier.getInitialMarker(String, boolean)

deleteURIs

public long deleteURIs(java.lang.String match)
Description copied from interface: Frontier
Delete any URI that matches the given regular expression from the list of discovered and pending URIs. This does not prevent them from being rediscovered.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Specified by:
deleteURIs in interface Frontier
Parameters:
match - A regular expression, any URIs that matches it will be deleted.
Returns:
The number of URIs deleted

deleteURIs

public long deleteURIs(java.lang.String uriMatch,
                       java.lang.String queueMatch)
Description copied from interface: Frontier
Delete any URI that matches the given regular expression from the list of discovered and pending URIs, if it is in a queue with a name matching the second regular expression. This does not prevent them from being rediscovered.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Specified by:
deleteURIs in interface Frontier
Parameters:
uriMatch - A regular expression, any URIs that matches will be deleted from the affected queues.
queueMatch - A regular expression, any queues matching will have their URIs checked. A null value means all queues.
Returns:
The number of URIs deleted

deleted

public void deleted(CrawlURI curi)
Description copied from interface: Frontier
Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.

Specified by:
deleted in interface Frontier
Parameters:
curi - Deleted CrawlURI.

considerIncluded

public void considerIncluded(UURI u)
Description copied from interface: Frontier
Notify Frontier that it should consider the given UURI as if already scheduled.

Specified by:
considerIncluded in interface Frontier
Parameters:
u - UURI instance to add to the Already Included set.

kickUpdate

public void kickUpdate()
Description copied from interface: Frontier
Notify Frontier that it should consider updating configuration info that may have changed in external files.

Specified by:
kickUpdate in interface Frontier

start

public void start()
Description copied from interface: Frontier
Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.

Specified by:
start in interface Frontier

pause

public void pause()
Description copied from interface: Frontier
Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.

Specified by:
pause in interface Frontier

unpause

public void unpause()
Description copied from interface: Frontier
Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Specified by:
unpause in interface Frontier

terminate

public void terminate()
Description copied from interface: Frontier
Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.

Specified by:
terminate in interface Frontier

getFrontierJournal

public FrontierJournal getFrontierJournal()
Specified by:
getFrontierJournal in interface Frontier
Returns:
Return the instance of FrontierJournal that this Frontier is using. May be null if no journaling.

importRecoverLog

public void importRecoverLog(java.lang.String pathToLog,
                             boolean retainFailures)
                      throws java.io.IOException
This method is not supported by this Frontier implementation

Specified by:
importRecoverLog in interface Frontier
Parameters:
pathToLog -
retainFailures -
Throws:
java.io.IOException

getReports

public java.lang.String[] getReports()
Description copied from interface: Reporter
Get an array of report names offered by this Reporter. A name in brackets indicates a free-form String, in accordance with the informal description inside the brackets, may yield a useful report.

Specified by:
getReports in interface Reporter
Returns:
String array of report names, empty if there is only one report type

singleLineReport

public java.lang.String singleLineReport()
Description copied from interface: Reporter
Return a short single-line summary report as a String.

Specified by:
singleLineReport in interface Reporter
Returns:
String single-line summary report

reportTo

public void reportTo(java.io.PrintWriter writer)
              throws java.io.IOException
Description copied from interface: Reporter
Make a default report to the passed-in Writer. Should be equivalent to reportTo(null, writer)

Specified by:
reportTo in interface Reporter
Parameters:
writer - to receive report
Throws:
java.io.IOException

singleLineReportTo

public void singleLineReportTo(java.io.PrintWriter w)
                        throws java.io.IOException
Description copied from interface: Reporter
Make a single-line summary report to the passed-in writer

Specified by:
singleLineReportTo in interface Reporter
Parameters:
w - to receive report
Throws:
java.io.IOException

singleLineLegend

public java.lang.String singleLineLegend()
Description copied from interface: Reporter
Return a legend for the single-line summary report as a String.

Specified by:
singleLineLegend in interface Reporter
Returns:
String single-line summary legend

reportTo

public void reportTo(java.lang.String name,
                     java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a report of the given name to the passed-in Writer, If null, give the default report.

Specified by:
reportTo in interface Reporter
writer - to receive report

finalTasks

public void finalTasks()
Description copied from interface: Frontier
Perform any final tasks *before* notification crawl has reached 'FINISHED' status. (For example, anything that needs to dump final data to disk/logs.)

Specified by:
finalTasks in interface Frontier

crawlStarted

public void crawlStarted(java.lang.String message)
Description copied from interface: CrawlStatusListener
Called on crawl start.

Specified by:
crawlStarted in interface CrawlStatusListener
Parameters:
message - Start message.

crawlEnding

public void crawlEnding(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is ending a crawl (for any reason)

Specified by:
crawlEnding in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlPausing

public void crawlPausing(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is going to be paused.

Specified by:
crawlPausing in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience

crawlPaused

public void crawlPaused(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is actually paused (all threads are idle).

Specified by:
crawlPaused in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_PAUSED. Passed for convenience

crawlResuming

public void crawlResuming(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is resuming a crawl that had been paused.

Specified by:
crawlResuming in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_RUNNING. Passed for convenience

crawlCheckpoint

public void crawlCheckpoint(java.io.File checkpointDir)
                     throws java.lang.Exception
Description copied from interface: CrawlStatusListener
Called by CrawlController when checkpointing.

Specified by:
crawlCheckpoint in interface CrawlStatusListener
Parameters:
checkpointDir - Checkpoint dir. Write checkpoint state here.
Throws:
java.lang.Exception - A fatal exception. Any exceptions that are let out of this checkpoint are assumed fatal and terminate further checkpoint processing.

receive

public void receive(CandidateURI item)
Specified by:
receive in interface UriUniqFilter.HasUriReceiver
Parameters:
item - Candidate uri tem that is 'visiting'.

getGroup

public Frontier.FrontierGroup getGroup(CrawlURI curi)
Description copied from interface: Frontier
Get the 'frontier group' (usually queue) for the given CrawlURI.

Specified by:
getGroup in interface Frontier
Parameters:
curi - CrawlURI to find matching group
Returns:
FrontierGroup for the CrawlURI

averageDepth

public long averageDepth()
Specified by:
averageDepth in interface Frontier

congestionRatio

public float congestionRatio()
Specified by:
congestionRatio in interface Frontier

deepestUri

public long deepestUri()
Specified by:
deepestUri in interface Frontier


Copyright © 2003-2011 Internet Archive. All Rights Reserved.