|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
An interface for URI Frontiers.
A URI Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. This includes (but is not limited to):
The Frontier is also responsible for enforcing any politeness restrictions that may have been applied to the crawl. Such as limiting simultaneous connection to the same host, server or IP number to 1 (or any other fixed amount), delays between connections etc.
A URIFrontier is created by the
CrawlController
which
is in turn responsible for providing access to it. Most significant among
those modules interested in the Frontier are the
ToeThreads
who perform the
actual work of processing a URI.
The methods defined in this interface are those required to get URIs for
processing, report the results of processing back (ToeThreads) and to get
access to various statistical data along the way. The statistical data is
of interest to Statistics Tracking
modules. A couple of additional methods are provided
to be able to inspect and manipulate the Frontier at runtime.
The statistical data exposed by this interface is:
Discovered URIs
Queued URIs
Finished URIs
Pending URIs
Successfully processed URIs
Failed to process URIs
Disregarded URIs
Total bytes written
In addition the frontier may optionally implement an interface that exposes information about hosts.
Furthermore any implementation of the URI Frontier should trigger
CrawlURIDispostionEvents
by invoking the proper methods on the
CrawlController
.
Doing this allows a custom built
Statistics Tracking
module to gather any other additional data it might be
interested in by examining the completed URIs.
All URI Frontiers inherit from
ModuleType
and therefore creating settings follows the usual pattern of pluggable modules
in Heritrix.
CrawlController
,
CrawlController.fireCrawledURIDisregardEvent(CrawlURI)
,
CrawlController.fireCrawledURIFailureEvent(CrawlURI)
,
CrawlController.fireCrawledURINeedRetryEvent(CrawlURI)
,
CrawlController.fireCrawledURISuccessfulEvent(CrawlURI)
,
StatisticsTracking
,
ToeThread
,
URIFrontierHostStatistics
,
ModuleType
Field Summary | |
static java.lang.String |
ATTR_NAME
All URI Frontiers should have the same 'name' attribute. |
Method Summary | |
void |
batchFlush()
Forces all the URIs that have been batched up for scheduling by the batchSchedule() method to be
actually scheduled. |
void |
batchSchedule(CandidateURI caURI)
Schedules a CandidateURI. |
void |
considerIncluded(UURI u)
Notify Frontier that it should consider the given UURI as if already scheduled. |
void |
deleted(CrawlURI curi)
Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle. |
long |
deleteURIs(java.lang.String match)
Delete any URI that matches the given regular expression from the list of discovered and pending URIs. |
long |
discoveredUriCount()
Number of discovered URIs. |
long |
disregardedFetchCount()
Number of URIs that were successfully fetched but have been disregarded. |
long |
failedFetchCount()
Number of URIs that failed to process. |
void |
finished(CrawlURI cURI)
Report a URI being processed as having finished processing. |
long |
finishedUriCount()
Number of URIs that have finished processing. |
FrontierJournal |
getFrontierJournal()
|
URIFrontierMarker |
getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
Get a URIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier. |
java.util.ArrayList |
getURIsList(URIFrontierMarker marker,
int numberOfMatches,
boolean verbose)
Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached. |
void |
importRecoverLog(java.lang.String pathToLog)
Recover earlier state by reading a recovery log. |
void |
initialize(CrawlController c)
Initialize the Frontier. |
boolean |
isEmpty()
Returns true if the frontier contains no more URIs to crawl. |
void |
kickUpdate()
Notify Frontier that it should consider updating configuration info that may have changed in external files. |
CrawlURI |
next(int timeout)
Get the next URI that should be processed. |
java.lang.String |
oneLineReport()
Compile a one-line summary report about this frontier. |
long |
pendingUriCount()
Number of URIs that are awaiting detailed processing. |
long |
queuedUriCount()
Number of URIs queued up and waiting for processing. |
java.lang.String |
report()
This methods compiles a human readable report on the status of the frontier at the time of the call. |
void |
schedule(CandidateURI caURI)
Schedules a CandidateURI. |
long |
successfullyFetchedCount()
Number of successfully processed URIs. |
long |
totalBytesWritten()
Total number of bytes contained in all URIs that have been processed. |
Field Detail |
public static final java.lang.String ATTR_NAME
ModuleType.ModuleType(String)
,
Constant Field ValuesMethod Detail |
public void initialize(CrawlController c) throws FatalConfigurationException, java.io.IOException
This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.
c
- The CrawlController that created the Frontier.
FatalConfigurationException
- If provided settings are illegal or
otherwise unusable.
java.io.IOException
- If there is a problem reading settings or seeds file
from disk.public CrawlURI next(int timeout) throws java.lang.InterruptedException
timeout
- how long the calling thread is willing to wait for the
next URI to become available (milliseconds).
java.lang.InterruptedException
public boolean isEmpty()
That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI.
public void schedule(CandidateURI caURI)
This method accepts one URI and schedules it immediately. This has
nothing to do with the priority of the URI being scheduled. Only that
it will be placed in it's respective queue at once. For priority
scheduling see CandidateURI
This method should be synchronized in all implementing classes.
caURI
- The URI to schedule.batchSchedule(CandidateURI)
,
CandidateURI.setSchedulingDirective(String)
public void batchSchedule(CandidateURI caURI)
This is a non-synchronized method for scheduling large numbers of
URIs at a time. All URIs scheduled with this method will be 'held' in
a thread specific container until batchFlush()
is
invoked.
caURI
- The URI to schedule.schedule(CandidateURI)
,
batchFlush()
public void batchFlush()
batchSchedule()
method to be
actually scheduled.
This is a synchronized method.
public void finished(CrawlURI cURI)
ToeThreads will invoke this method once they have completed work on their assigned URI.
This method is synchronized and also schedules any URIs that have been
batched up by batchSchedule()
cURI
- The URI that has finished processing.batchFlush()
public long discoveredUriCount()
That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.
public long queuedUriCount()
This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.
public long finishedUriCount()
Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
public long pendingUriCount()
Number of discovered URIs that have not been inspected for scope or duplicates (generally referred to as pending URIs. Depending on the implementation of the URIFrontier this might always be zero. It may also be an adjusted number that tries to account for duplicates by estimation.
This does not count URIs scheduled with
batchSchedule()
and are waiting for
the batch to be flushed.
public long successfullyFetchedCount()
Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.
public long failedFetchCount()
URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.
public long disregardedFetchCount()
Counts any URI that is successfully fetched only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.
public long totalBytesWritten()
public java.lang.String oneLineReport()
public java.lang.String report()
This report should give an accurate picture of the current state of the frontier.
public void importRecoverLog(java.lang.String pathToLog) throws java.io.IOException
Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.
pathToLog
- The name (with full path) of the recover log.
java.io.IOException
- If problems occur reading the recover log.public URIFrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
URIFrontierMarker
initialized with the given
regular expression at the 'start' of the Frontier.
regexpr
- The regular expression that URIs within the frontier must
match to be considered within the scope of this markerinCacheOnly
- If set to true, only those URIs within the frontier
that are stored in cache (usually this means in memory
rather then on disk, but that is an implementation
detail) will be considered. Others will be entierly
ignored, as if they dont exist. This is usefull for quick
peeks at the top of the URI list.
public java.util.ArrayList getURIsList(URIFrontierMarker marker, int numberOfMatches, boolean verbose) throws InvalidURIFrontierMarkerException
numberOfMatches
is reached.
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.
The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).
The URIFrontierMarker
will be advanced to the position at
which it's maximum number of matches found is reached. Reusing it for
subsequent calls will thus effectively get the 'next' batch. Making
any changes to the frontier can invalidate the marker.
While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
marker
- A marker specifing from what position in the Frontier the
list should begin.numberOfMatches
- how many URIs to add at most to the list before returning itverbose
- if set to true the strings returned will contain additional
information about each URI beyond their names.
InvalidURIFrontierMarkerException
- when the
URIFronterMarker
does not match the internal
state of the frontier. Tolerance for this can vary
considerably from one URIFrontier implementation to the next.URIFrontierMarker
,
getInitialMarker(String, boolean)
public long deleteURIs(java.lang.String match)
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
match
- A regular expression, any URIs that matches it will be
deleted.
public void deleted(CrawlURI curi)
curi
- Deleted CrawlURI.public void considerIncluded(UURI u)
u
- UURI instance to add to the Already Included set.public void kickUpdate()
public FrontierJournal getFrontierJournal()
FrontierJournal
that
this Frontier is using. May be null if no journaling.
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |