|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
public interface Frontier
An interface for URI Frontiers.
A URI Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. This includes (but is not limited to):
The Frontier is also responsible for enforcing any politeness restrictions that may have been applied to the crawl. Such as limiting simultaneous connection to the same host, server or IP number to 1 (or any other fixed amount), delays between connections etc.
A URIFrontier is created by the
CrawlController
which
is in turn responsible for providing access to it. Most significant among
those modules interested in the Frontier are the
ToeThreads
who perform the
actual work of processing a URI.
The methods defined in this interface are those required to get URIs for
processing, report the results of processing back (ToeThreads) and to get
access to various statistical data along the way. The statistical data is
of interest to Statistics Tracking
modules. A couple of additional methods are provided
to be able to inspect and manipulate the Frontier at runtime.
The statistical data exposed by this interface is:
Discovered URIs
Queued URIs
Finished URIs
Successfully processed URIs
Failed to process URIs
Disregarded URIs
Total bytes written
In addition the frontier may optionally implement an interface that exposes information about hosts.
Furthermore any implementation of the URI Frontier should trigger
CrawlURIDispostionEvents
by invoking the proper methods on the
CrawlController
.
Doing this allows a custom built
Statistics Tracking
module to gather any other additional data it might be
interested in by examining the completed URIs.
All URI Frontiers inherit from
ModuleType
and therefore creating settings follows the usual pattern of pluggable modules
in Heritrix.
CrawlController
,
CrawlController.fireCrawledURIDisregardEvent(CrawlURI)
,
CrawlController.fireCrawledURIFailureEvent(CrawlURI)
,
CrawlController.fireCrawledURINeedRetryEvent(CrawlURI)
,
CrawlController.fireCrawledURISuccessfulEvent(CrawlURI)
,
StatisticsTracking
,
ToeThread
,
FrontierHostStatistics
,
ModuleType
Nested Class Summary | |
---|---|
static interface |
Frontier.FrontierGroup
Generic interface representing the internal groupings of a Frontier's URIs -- usually queues. |
Field Summary | |
---|---|
static java.lang.String |
ATTR_NAME
All URI Frontiers should have the same 'name' attribute. |
Method Summary | |
---|---|
long |
averageDepth()
|
float |
congestionRatio()
|
void |
considerIncluded(UURI u)
Notify Frontier that it should consider the given UURI as if already scheduled. |
long |
deepestUri()
|
void |
deleted(CrawlURI curi)
Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle. |
long |
deleteURIs(java.lang.String match)
Delete any URI that matches the given regular expression from the list of discovered and pending URIs. |
long |
deleteURIs(java.lang.String uriMatch,
java.lang.String queueMatch)
Delete any URI that matches the given regular expression from the list of discovered and pending URIs, if it is in a queue with a name matching the second regular expression. |
long |
discoveredUriCount()
Number of discovered URIs. |
long |
disregardedUriCount()
Number of URIs that were scheduled at one point but have been disregarded. |
long |
failedFetchCount()
Number of URIs that failed to process. |
void |
finalTasks()
Perform any final tasks *before* notification crawl has reached 'FINISHED' status. |
void |
finished(CrawlURI cURI)
Report a URI being processed as having finished processing. |
long |
finishedUriCount()
Number of URIs that have finished processing. |
java.lang.String |
getClassKey(CandidateURI cauri)
|
FrontierJournal |
getFrontierJournal()
|
Frontier.FrontierGroup |
getGroup(CrawlURI curi)
Get the 'frontier group' (usually queue) for the given CrawlURI. |
FrontierMarker |
getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
Get a URIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier. |
java.util.ArrayList<java.lang.String> |
getURIsList(FrontierMarker marker,
int numberOfMatches,
boolean verbose)
Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached. |
void |
importRecoverLog(java.lang.String pathToLog,
boolean retainFailures)
Recover earlier state by reading a recovery log. |
void |
initialize(CrawlController c)
Initialize the Frontier. |
boolean |
isEmpty()
Returns true if the frontier contains no more URIs to crawl. |
void |
kickUpdate()
Notify Frontier that it should consider updating configuration info that may have changed in external files. |
void |
loadSeeds()
Request that the Frontier load (or reload) crawl seeds, typically by contacting the Scope. |
CrawlURI |
next()
Get the next URI that should be processed. |
void |
pause()
Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise. |
long |
queuedUriCount()
Number of URIs queued up and waiting for processing. |
void |
schedule(CandidateURI caURI)
Schedules a CandidateURI. |
void |
start()
Request that Frontier allow crawling to begin. |
long |
succeededFetchCount()
Number of successfully processed URIs. |
void |
terminate()
Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException. |
long |
totalBytesWritten()
Deprecated. misnomer; consult StatisticsTracker instead |
void |
unpause()
Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed. |
Methods inherited from interface org.archive.util.Reporter |
---|
getReports, reportTo, reportTo, singleLineLegend, singleLineReport, singleLineReportTo |
Field Detail |
---|
static final java.lang.String ATTR_NAME
ModuleType.ModuleType(String)
,
Constant Field ValuesMethod Detail |
---|
void initialize(CrawlController c) throws FatalConfigurationException, java.io.IOException
This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.
c
- The CrawlController that created the Frontier.
FatalConfigurationException
- If provided settings are illegal or
otherwise unusable.
java.io.IOException
- If there is a problem reading settings or seeds file
from disk.CrawlURI next() throws java.lang.InterruptedException, EndedException
java.lang.InterruptedException
EndedException
boolean isEmpty()
That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI.
void schedule(CandidateURI caURI)
This method accepts one URI and schedules it immediately. This has
nothing to do with the priority of the URI being scheduled. Only that
it will be placed in it's respective queue at once. For priority
scheduling see CandidateURI.setSchedulingDirective(int)
This method should be synchronized in all implementing classes.
caURI
- The URI to schedule.CandidateURI.setSchedulingDirective(int)
void finished(CrawlURI cURI)
ToeThreads will invoke this method once they have completed work on their assigned URI.
This method is synchronized.
cURI
- The URI that has finished processing.long discoveredUriCount()
That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.
long queuedUriCount()
This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.
long deepestUri()
long averageDepth()
float congestionRatio()
long finishedUriCount()
Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
long succeededFetchCount()
Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.
long failedFetchCount()
URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.
long disregardedUriCount()
Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.
long totalBytesWritten()
void importRecoverLog(java.lang.String pathToLog, boolean retainFailures) throws java.io.IOException
Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.
pathToLog
- The name (with full path) of the recover log.retainFailures
- If true, failures in log should count as
having been included. (If false, failures will be ignored, meaning
the corresponding URIs will be retried in the recovered crawl.)
java.io.IOException
- If problems occur reading the recover log.FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
URIFrontierMarker
initialized with the given
regular expression at the 'start' of the Frontier.
regexpr
- The regular expression that URIs within the frontier must
match to be considered within the scope of this markerinCacheOnly
- If set to true, only those URIs within the frontier
that are stored in cache (usually this means in memory
rather then on disk, but that is an implementation
detail) will be considered. Others will be entierly
ignored, as if they dont exist. This is usefull for quick
peeks at the top of the URI list.
java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) throws InvalidFrontierMarkerException
numberOfMatches
is reached.
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.
The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).
The URIFrontierMarker
will be advanced to the position at
which it's maximum number of matches found is reached. Reusing it for
subsequent calls will thus effectively get the 'next' batch. Making
any changes to the frontier can invalidate the marker.
While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
marker
- A marker specifing from what position in the Frontier the
list should begin.numberOfMatches
- how many URIs to add at most to the list before returning itverbose
- if set to true the strings returned will contain additional
information about each URI beyond their names.
InvalidFrontierMarkerException
- when the
URIFronterMarker
does not match the internal
state of the frontier. Tolerance for this can vary
considerably from one URIFrontier implementation to the next.FrontierMarker
,
getInitialMarker(String, boolean)
long deleteURIs(java.lang.String match)
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
match
- A regular expression, any URIs that matches it will be
deleted.
long deleteURIs(java.lang.String uriMatch, java.lang.String queueMatch)
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
uriMatch
- A regular expression, any URIs that matches will be
deleted from the affected queues.queueMatch
- A regular expression, any queues matching will have
their URIs checked. A null value means all queues.
void deleted(CrawlURI curi)
curi
- Deleted CrawlURI.void considerIncluded(UURI u)
u
- UURI instance to add to the Already Included set.void kickUpdate()
void pause()
void unpause()
void terminate()
FrontierJournal getFrontierJournal()
FrontierJournal
that
this Frontier is using. May be null if no journaling.java.lang.String getClassKey(CandidateURI cauri)
cauri
- CandidateURI for which we're to calculate and
set class key.
cauri
.void loadSeeds()
void start()
Frontier.FrontierGroup getGroup(CrawlURI curi)
curi
- CrawlURI to find matching group
void finalTasks()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |