org.archive.crawler.framework
Interface Frontier

All Superinterfaces:
Reporter
All Known Implementing Classes:
AbstractFrontier, AdaptiveRevisitFrontier, BdbFrontier, DomainSensitiveFrontier, WorkQueueFrontier

public interface Frontier
extends Reporter

An interface for URI Frontiers.

A URI Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. This includes (but is not limited to):

The Frontier is also responsible for enforcing any politeness restrictions that may have been applied to the crawl. Such as limiting simultaneous connection to the same host, server or IP number to 1 (or any other fixed amount), delays between connections etc.

A URIFrontier is created by the CrawlController which is in turn responsible for providing access to it. Most significant among those modules interested in the Frontier are the ToeThreads who perform the actual work of processing a URI.

The methods defined in this interface are those required to get URIs for processing, report the results of processing back (ToeThreads) and to get access to various statistical data along the way. The statistical data is of interest to Statistics Tracking modules. A couple of additional methods are provided to be able to inspect and manipulate the Frontier at runtime.

The statistical data exposed by this interface is:

In addition the frontier may optionally implement an interface that exposes information about hosts.

Furthermore any implementation of the URI Frontier should trigger CrawlURIDispostionEvents by invoking the proper methods on the CrawlController. Doing this allows a custom built Statistics Tracking module to gather any other additional data it might be interested in by examining the completed URIs.

All URI Frontiers inherit from ModuleType and therefore creating settings follows the usual pattern of pluggable modules in Heritrix.

Author:
Gordon Mohr, Kristinn Sigurdsson
See Also:
CrawlController, CrawlController.fireCrawledURIDisregardEvent(CrawlURI), CrawlController.fireCrawledURIFailureEvent(CrawlURI), CrawlController.fireCrawledURINeedRetryEvent(CrawlURI), CrawlController.fireCrawledURISuccessfulEvent(CrawlURI), StatisticsTracking, ToeThread, FrontierHostStatistics, ModuleType

Nested Class Summary
static interface Frontier.FrontierGroup
          Generic interface representing the internal groupings of a Frontier's URIs -- usually queues.
 
Field Summary
static java.lang.String ATTR_NAME
          All URI Frontiers should have the same 'name' attribute.
 
Method Summary
 long averageDepth()
           
 float congestionRatio()
           
 void considerIncluded(UURI u)
          Notify Frontier that it should consider the given UURI as if already scheduled.
 long deepestUri()
           
 void deleted(CrawlURI curi)
          Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
 long deleteURIs(java.lang.String match)
          Delete any URI that matches the given regular expression from the list of discovered and pending URIs.
 long deleteURIs(java.lang.String uriMatch, java.lang.String queueMatch)
          Delete any URI that matches the given regular expression from the list of discovered and pending URIs, if it is in a queue with a name matching the second regular expression.
 long discoveredUriCount()
          Number of discovered URIs.
 long disregardedUriCount()
          Number of URIs that were scheduled at one point but have been disregarded.
 long failedFetchCount()
          Number of URIs that failed to process.
 void finalTasks()
          Perform any final tasks *before* notification crawl has reached 'FINISHED' status.
 void finished(CrawlURI cURI)
          Report a URI being processed as having finished processing.
 long finishedUriCount()
          Number of URIs that have finished processing.
 java.lang.String getClassKey(CandidateURI cauri)
           
 FrontierJournal getFrontierJournal()
           
 Frontier.FrontierGroup getGroup(CrawlURI curi)
          Get the 'frontier group' (usually queue) for the given CrawlURI.
 FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
          Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.
 java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
          Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.
 void importRecoverLog(java.lang.String pathToLog, boolean retainFailures)
          Recover earlier state by reading a recovery log.
 void initialize(CrawlController c)
          Initialize the Frontier.
 boolean isEmpty()
          Returns true if the frontier contains no more URIs to crawl.
 void kickUpdate()
          Notify Frontier that it should consider updating configuration info that may have changed in external files.
 void loadSeeds()
          Request that the Frontier load (or reload) crawl seeds, typically by contacting the Scope.
 CrawlURI next()
          Get the next URI that should be processed.
 void pause()
          Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
 long queuedUriCount()
          Number of URIs queued up and waiting for processing.
 void schedule(CandidateURI caURI)
          Schedules a CandidateURI.
 void start()
          Request that Frontier allow crawling to begin.
 long succeededFetchCount()
          Number of successfully processed URIs.
 void terminate()
          Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
 long totalBytesWritten()
          Deprecated. misnomer; consult StatisticsTracker instead
 void unpause()
          Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.
 
Methods inherited from interface org.archive.util.Reporter
getReports, reportTo, reportTo, singleLineLegend, singleLineReport, singleLineReportTo
 

Field Detail

ATTR_NAME

static final java.lang.String ATTR_NAME
All URI Frontiers should have the same 'name' attribute. This constant defines that name. This is a name used to reference the Frontier being used in a given crawl order and since there can only be one Frontier per crawl order a fixed, unique name for Frontiers is optimal.

See Also:
ModuleType.ModuleType(String), Constant Field Values
Method Detail

initialize

void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Initialize the Frontier.

This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.

Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.

next

CrawlURI next()
              throws java.lang.InterruptedException,
                     EndedException
Get the next URI that should be processed. If no URI becomes availible during the time specified null will be returned.

Returns:
the next URI that should be processed.
Throws:
java.lang.InterruptedException
EndedException

isEmpty

boolean isEmpty()
Returns true if the frontier contains no more URIs to crawl.

That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI.

Returns:
true if the frontier contains no more URIs to crawl.

schedule

void schedule(CandidateURI caURI)
Schedules a CandidateURI.

This method accepts one URI and schedules it immediately. This has nothing to do with the priority of the URI being scheduled. Only that it will be placed in it's respective queue at once. For priority scheduling see CandidateURI.setSchedulingDirective(int)

This method should be synchronized in all implementing classes.

Parameters:
caURI - The URI to schedule.
See Also:
CandidateURI.setSchedulingDirective(int)

finished

void finished(CrawlURI cURI)
Report a URI being processed as having finished processing.

ToeThreads will invoke this method once they have completed work on their assigned URI.

This method is synchronized.

Parameters:
cURI - The URI that has finished processing.

discoveredUriCount

long discoveredUriCount()
Number of discovered URIs.

That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.

Returns:
Number of discovered URIs.

queuedUriCount

long queuedUriCount()
Number of URIs queued up and waiting for processing.

This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.

Returns:
Number of queued URIs.

deepestUri

long deepestUri()

averageDepth

long averageDepth()

congestionRatio

float congestionRatio()

finishedUriCount

long finishedUriCount()
Number of URIs that have finished processing.

Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Returns:
Number of finished URIs.

succeededFetchCount

long succeededFetchCount()
Number of successfully processed URIs.

Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.

Returns:
Number of successfully processed URIs.

failedFetchCount

long failedFetchCount()
Number of URIs that failed to process.

URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.

Returns:
Number of URIs that failed to process.

disregardedUriCount

long disregardedUriCount()
Number of URIs that were scheduled at one point but have been disregarded.

Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.

Returns:
The number of URIs that have been disregarded.

totalBytesWritten

long totalBytesWritten()
Deprecated. misnomer; consult StatisticsTracker instead

Total number of bytes contained in all URIs that have been processed.

Returns:
The total amounts of bytes in all processed URIs.

importRecoverLog

void importRecoverLog(java.lang.String pathToLog,
                      boolean retainFailures)
                      throws java.io.IOException
Recover earlier state by reading a recovery log.

Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.

Parameters:
pathToLog - The name (with full path) of the recover log.
retainFailures - If true, failures in log should count as having been included. (If false, failures will be ignored, meaning the corresponding URIs will be retried in the recovered crawl.)
Throws:
java.io.IOException - If problems occur reading the recover log.

getInitialMarker

FrontierMarker getInitialMarker(java.lang.String regexpr,
                                boolean inCacheOnly)
Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.

Parameters:
regexpr - The regular expression that URIs within the frontier must match to be considered within the scope of this marker
inCacheOnly - If set to true, only those URIs within the frontier that are stored in cache (usually this means in memory rather then on disk, but that is an implementation detail) will be considered. Others will be entierly ignored, as if they dont exist. This is usefull for quick peeks at the top of the URI list.
Returns:
A URIFrontierMarker that is set for the 'start' of the frontier's URI list.

getURIsList

java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker,
                                                  int numberOfMatches,
                                                  boolean verbose)
                                                  throws InvalidFrontierMarkerException
Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.

The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).

The URIFrontierMarker will be advanced to the position at which it's maximum number of matches found is reached. Reusing it for subsequent calls will thus effectively get the 'next' batch. Making any changes to the frontier can invalidate the marker.

While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Parameters:
marker - A marker specifing from what position in the Frontier the list should begin.
numberOfMatches - how many URIs to add at most to the list before returning it
verbose - if set to true the strings returned will contain additional information about each URI beyond their names.
Returns:
a list of all pending URIs falling within the specification of the marker
Throws:
InvalidFrontierMarkerException - when the URIFronterMarker does not match the internal state of the frontier. Tolerance for this can vary considerably from one URIFrontier implementation to the next.
See Also:
FrontierMarker, getInitialMarker(String, boolean)

deleteURIs

long deleteURIs(java.lang.String match)
Delete any URI that matches the given regular expression from the list of discovered and pending URIs. This does not prevent them from being rediscovered.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Parameters:
match - A regular expression, any URIs that matches it will be deleted.
Returns:
The number of URIs deleted

deleteURIs

long deleteURIs(java.lang.String uriMatch,
                java.lang.String queueMatch)
Delete any URI that matches the given regular expression from the list of discovered and pending URIs, if it is in a queue with a name matching the second regular expression. This does not prevent them from being rediscovered.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Parameters:
uriMatch - A regular expression, any URIs that matches will be deleted from the affected queues.
queueMatch - A regular expression, any queues matching will have their URIs checked. A null value means all queues.
Returns:
The number of URIs deleted

deleted

void deleted(CrawlURI curi)
Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.

Parameters:
curi - Deleted CrawlURI.

considerIncluded

void considerIncluded(UURI u)
Notify Frontier that it should consider the given UURI as if already scheduled.

Parameters:
u - UURI instance to add to the Already Included set.

kickUpdate

void kickUpdate()
Notify Frontier that it should consider updating configuration info that may have changed in external files.


pause

void pause()
Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.


unpause

void unpause()
Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.


terminate

void terminate()
Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.


getFrontierJournal

FrontierJournal getFrontierJournal()
Returns:
Return the instance of FrontierJournal that this Frontier is using. May be null if no journaling.

getClassKey

java.lang.String getClassKey(CandidateURI cauri)
Parameters:
cauri - CandidateURI for which we're to calculate and set class key.
Returns:
Classkey for cauri.

loadSeeds

void loadSeeds()
Request that the Frontier load (or reload) crawl seeds, typically by contacting the Scope.


start

void start()
Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.


getGroup

Frontier.FrontierGroup getGroup(CrawlURI curi)
Get the 'frontier group' (usually queue) for the given CrawlURI.

Parameters:
curi - CrawlURI to find matching group
Returns:
FrontierGroup for the CrawlURI

finalTasks

void finalTasks()
Perform any final tasks *before* notification crawl has reached 'FINISHED' status. (For example, anything that needs to dump final data to disk/logs.)



Copyright © 2003-2011 Internet Archive. All Rights Reserved.