org.archive.crawler.framework
Interface URIFrontier

All Known Implementing Classes:
Frontier

public interface URIFrontier

An interface for URI Frontiers.

A URI Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. This includes (but is not limited to):

The Frontier is also responsible for enforcing any politeness restrictions that may have been applied to the crawl. Such as limiting simultaneous connection to the same host, server or IP number to 1 (or any other fixed amount), delays between connections etc.

A URIFrontier is created by the CrawlController which is in turn responsible for providing access to it. Most significant among those modules interested in the Frontier are the ToeThreads who perform the actual work of processing a URI.

The methods defined in this interface are those required to get URIs for processing, report the results of processing back (ToeThreads) and to get access to various statistical data along the way. The statistical data is of interest to Statistics Tracking modules. A couple of additional methods are provided to be able to inspect and manipulate the Frontier at runtime.

The statistical data exposed by this interface is:

In addition the frontier may optionally implement an interface that exposes information about hosts.

Furthermore any implementation of the URI Frontier should trigger CrawlURIDispostionEvents by invoking the proper methods on the CrawlController. Doing this allows a custom built Statistics Tracking module to gather any other additional data it might be interested in by examining the completed URIs.

All URI Frontiers inherit from ModuleType and therefore creating settings follows the usual pattern of pluggable modules in Heritrix.

Author:
Gordon Mohr, Kristinn Sigurdsson
See Also:
CrawlController, CrawlController.fireCrawledURIDisregardEvent(CrawlURI), CrawlController.fireCrawledURIFailureEvent(CrawlURI), CrawlController.fireCrawledURINeedRetryEvent(CrawlURI), CrawlController.fireCrawledURISuccessfulEvent(CrawlURI), StatisticsTracking, ToeThread, URIFrontierHostStatistics, ModuleType

Field Summary
static java.lang.String ATTR_NAME
          All URI Frontiers should have the same 'name' attribute.
 
Method Summary
 void batchFlush()
          Forces all the URIs that have been batched up for scheduling by the batchSchedule() method to be actually scheduled.
 void batchSchedule(CandidateURI caURI)
          Schedules a CandidateURI.
 void considerIncluded(UURI u)
          Notify Frontier that it should consider the given UURI as if already scheduled.
 void deleted(CrawlURI curi)
          Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
 long deleteURIs(java.lang.String match)
          Delete any URI that matches the given regular expression from the list of discovered and pending URIs.
 long discoveredUriCount()
          Number of discovered URIs.
 long disregardedFetchCount()
          Number of URIs that were successfully fetched but have been disregarded.
 long failedFetchCount()
          Number of URIs that failed to process.
 void finished(CrawlURI cURI)
          Report a URI being processed as having finished processing.
 long finishedUriCount()
          Number of URIs that have finished processing.
 FrontierJournal getFrontierJournal()
           
 URIFrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
          Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.
 java.util.ArrayList getURIsList(URIFrontierMarker marker, int numberOfMatches, boolean verbose)
          Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.
 void importRecoverLog(java.lang.String pathToLog)
          Recover earlier state by reading a recovery log.
 void initialize(CrawlController c)
          Initialize the Frontier.
 boolean isEmpty()
          Returns true if the frontier contains no more URIs to crawl.
 void kickUpdate()
          Notify Frontier that it should consider updating configuration info that may have changed in external files.
 CrawlURI next(int timeout)
          Get the next URI that should be processed.
 java.lang.String oneLineReport()
          Compile a one-line summary report about this frontier.
 long pendingUriCount()
          Number of URIs that are awaiting detailed processing.
 long queuedUriCount()
          Number of URIs queued up and waiting for processing.
 java.lang.String report()
          This methods compiles a human readable report on the status of the frontier at the time of the call.
 void schedule(CandidateURI caURI)
          Schedules a CandidateURI.
 long successfullyFetchedCount()
          Number of successfully processed URIs.
 long totalBytesWritten()
          Total number of bytes contained in all URIs that have been processed.
 

Field Detail

ATTR_NAME

public static final java.lang.String ATTR_NAME
All URI Frontiers should have the same 'name' attribute. This constant defines that name. This is a name used to reference the Frontier being used in a given crawl order and since there can only be one Frontier per crawl order a fixed, unique name for Frontiers is optimal.

See Also:
ModuleType.ModuleType(String), Constant Field Values
Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Initialize the Frontier.

This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.

Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.

next

public CrawlURI next(int timeout)
              throws java.lang.InterruptedException
Get the next URI that should be processed. If no URI becomes availible during the time specified null will be returned.

Parameters:
timeout - how long the calling thread is willing to wait for the next URI to become available (milliseconds).
Returns:
the next URI that should be processed.
Throws:
java.lang.InterruptedException

isEmpty

public boolean isEmpty()
Returns true if the frontier contains no more URIs to crawl.

That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI.

Returns:
true if the frontier contains no more URIs to crawl.

schedule

public void schedule(CandidateURI caURI)
Schedules a CandidateURI.

This method accepts one URI and schedules it immediately. This has nothing to do with the priority of the URI being scheduled. Only that it will be placed in it's respective queue at once. For priority scheduling see CandidateURI

This method should be synchronized in all implementing classes.

Parameters:
caURI - The URI to schedule.
See Also:
batchSchedule(CandidateURI), CandidateURI.setSchedulingDirective(String)

batchSchedule

public void batchSchedule(CandidateURI caURI)
Schedules a CandidateURI.

This is a non-synchronized method for scheduling large numbers of URIs at a time. All URIs scheduled with this method will be 'held' in a thread specific container until batchFlush() is invoked.

Parameters:
caURI - The URI to schedule.
See Also:
schedule(CandidateURI), batchFlush()

batchFlush

public void batchFlush()
Forces all the URIs that have been batched up for scheduling by the batchSchedule() method to be actually scheduled.

This is a synchronized method.


finished

public void finished(CrawlURI cURI)
Report a URI being processed as having finished processing.

ToeThreads will invoke this method once they have completed work on their assigned URI.

This method is synchronized and also schedules any URIs that have been batched up by batchSchedule()

Parameters:
cURI - The URI that has finished processing.
See Also:
batchFlush()

discoveredUriCount

public long discoveredUriCount()
Number of discovered URIs.

That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.

Returns:
Number of discovered URIs.

queuedUriCount

public long queuedUriCount()
Number of URIs queued up and waiting for processing.

This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.

Returns:
Number of queued URIs.

finishedUriCount

public long finishedUriCount()
Number of URIs that have finished processing.

Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Returns:
Number of finished URIs.

pendingUriCount

public long pendingUriCount()
Number of URIs that are awaiting detailed processing.

Number of discovered URIs that have not been inspected for scope or duplicates (generally referred to as pending URIs. Depending on the implementation of the URIFrontier this might always be zero. It may also be an adjusted number that tries to account for duplicates by estimation.

This does not count URIs scheduled with batchSchedule() and are waiting for the batch to be flushed.

Returns:
Estimated number of URIs scheduled for prcoessing.

successfullyFetchedCount

public long successfullyFetchedCount()
Number of successfully processed URIs.

Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.

Returns:
Number of successfully processed URIs.

failedFetchCount

public long failedFetchCount()
Number of URIs that failed to process.

URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.

Returns:
Number of URIs that failed to process.

disregardedFetchCount

public long disregardedFetchCount()
Number of URIs that were successfully fetched but have been disregarded.

Counts any URI that is successfully fetched only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.

Returns:
The number of URIs that have been disregarded.

totalBytesWritten

public long totalBytesWritten()
Total number of bytes contained in all URIs that have been processed.

Returns:
The total amounts of bytes in all processed URIs.

oneLineReport

public java.lang.String oneLineReport()
Compile a one-line summary report about this frontier.

Returns:
A one-line report of this frontier's status.

report

public java.lang.String report()
This methods compiles a human readable report on the status of the frontier at the time of the call.

This report should give an accurate picture of the current state of the frontier.

Returns:
A report on the current status of the frontier.

importRecoverLog

public void importRecoverLog(java.lang.String pathToLog)
                      throws java.io.IOException
Recover earlier state by reading a recovery log.

Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.

Parameters:
pathToLog - The name (with full path) of the recover log.
Throws:
java.io.IOException - If problems occur reading the recover log.

getInitialMarker

public URIFrontierMarker getInitialMarker(java.lang.String regexpr,
                                          boolean inCacheOnly)
Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.

Parameters:
regexpr - The regular expression that URIs within the frontier must match to be considered within the scope of this marker
inCacheOnly - If set to true, only those URIs within the frontier that are stored in cache (usually this means in memory rather then on disk, but that is an implementation detail) will be considered. Others will be entierly ignored, as if they dont exist. This is usefull for quick peeks at the top of the URI list.
Returns:
A URIFrontierMarker that is set for the 'start' of the frontier's URI list.

getURIsList

public java.util.ArrayList getURIsList(URIFrontierMarker marker,
                                       int numberOfMatches,
                                       boolean verbose)
                                throws InvalidURIFrontierMarkerException
Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.

The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).

The URIFrontierMarker will be advanced to the position at which it's maximum number of matches found is reached. Reusing it for subsequent calls will thus effectively get the 'next' batch. Making any changes to the frontier can invalidate the marker.

While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Parameters:
marker - A marker specifing from what position in the Frontier the list should begin.
numberOfMatches - how many URIs to add at most to the list before returning it
verbose - if set to true the strings returned will contain additional information about each URI beyond their names.
Returns:
a list of all pending URIs falling within the specification of the marker
Throws:
InvalidURIFrontierMarkerException - when the URIFronterMarker does not match the internal state of the frontier. Tolerance for this can vary considerably from one URIFrontier implementation to the next.
See Also:
URIFrontierMarker, getInitialMarker(String, boolean)

deleteURIs

public long deleteURIs(java.lang.String match)
Delete any URI that matches the given regular expression from the list of discovered and pending URIs. This does not prevent them from being rediscovered.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Parameters:
match - A regular expression, any URIs that matches it will be deleted.
Returns:
The number of URIs deleted

deleted

public void deleted(CrawlURI curi)
Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.

Parameters:
curi - Deleted CrawlURI.

considerIncluded

public void considerIncluded(UURI u)
Notify Frontier that it should consider the given UURI as if already scheduled.

Parameters:
u - UURI instance to add to the Already Included set.

kickUpdate

public void kickUpdate()
Notify Frontier that it should consider updating configuration info that may have changed in external files.


getFrontierJournal

public FrontierJournal getFrontierJournal()
Returns:
Return the instance of FrontierJournal that this Frontier is using. May be null if no journaling.


Copyright © 2003-2004 Internet Archive. All Rights Reserved.