org.archive.crawler.framework
Interface StatisticsTracking

All Superinterfaces:
java.lang.Runnable
All Known Implementing Classes:
AbstractTracker, StatisticsTracker

public interface StatisticsTracking
extends java.lang.Runnable

An interface for objects that want to collect statistics on running crawls. An implementation of this is referenced in the crawl order and loaded when the crawl begins.

It will be given a reference to the relevant CrawlController. The CrawlController will contain any additional configuration information needed.

Any class that implements this interface can be specified as a statistics tracker in a crawl order. The CrawlController will then create and initialize a copy of it and call it's start() method.

This interface also specifies several methods to access data that the CrawlController or the URIFrontier may be interested in at run time but do not want to have keep track of for themselves. AbstractTracker implements these. If there are more then one StatisticsTracking classes defined in the crawl order only the first one will be used to access this data.

It is recommended that it register for CrawlStatus events and CrawlURIDisposition events to be able to properly monitor a crawl. Both are registered with the CrawlController.

Author:
Kristinn Sigurdsson
See Also:
AbstractTracker, CrawlStatusListener, CrawlURIDispositionListener, CrawlController

Field Summary
static java.lang.String SEED_DISPOSITION_DISREGARD
          Seed was disregarded
static java.lang.String SEED_DISPOSITION_FAILURE
          Failed to crawl seed
static java.lang.String SEED_DISPOSITION_NOT_PROCESSED
          Seed has not been processed
static java.lang.String SEED_DISPOSITION_RETRY
          Failed to crawl seed, will retry
static java.lang.String SEED_DISPOSITION_SUCCESS
          Seed successfully crawled
 
Method Summary
 int activeThreadCount()
          Get the number of active (non-paused) threads.
 long averageDepth()
           
 float congestionRatio()
           
 long crawlDuration()
          Returns how long the current crawl has been running (excluding any time spent paused/suspended/stopped) since it began.
 double currentProcessedDocsPerSec()
          Returns an estimate of recent document download rates based on a queue of recently seen CrawlURIs (as of last snapshot).
 int currentProcessedKBPerSec()
          Calculates an estimate of the rate, in kb, at which documents are currently being processed by the crawler.
 long deepestUri()
           
 long getCrawlerTotalElapsedTime()
          Total amount of time spent actively crawling so far.
 java.util.Map getProgressStatistics()
           
 java.lang.String getProgressStatisticsLine()
           
 java.util.Iterator getSeedRecordsSortedByStatusCode()
          Get a SeedRecord iterator for the job being monitored.
 void initialize(CrawlController c)
          Do initialization.
 void noteStart()
          Start the tracker's crawl timing.
 double processedDocsPerSec()
          Returns the number of documents that have been processed per second over the life of the crawl (as of last snapshot)
 long processedKBPerSec()
          Calculates the rate that data, in kb, has been processed over the life of the crawl (as of last snapshot.)
 java.lang.String progressStatisticsLegend()
           
 long successfullyFetchedCount()
          Number of successfully processed URIs.
 long totalBytesCrawled()
          Returns the total number of uncompressed bytes crawled.
 long totalBytesWritten()
          Deprecated. misnomer; use totalBytesCrawled instead
 long totalCount()
           
 
Methods inherited from interface java.lang.Runnable
run
 

Field Detail

SEED_DISPOSITION_SUCCESS

static final java.lang.String SEED_DISPOSITION_SUCCESS
Seed successfully crawled

See Also:
Constant Field Values

SEED_DISPOSITION_FAILURE

static final java.lang.String SEED_DISPOSITION_FAILURE
Failed to crawl seed

See Also:
Constant Field Values

SEED_DISPOSITION_RETRY

static final java.lang.String SEED_DISPOSITION_RETRY
Failed to crawl seed, will retry

See Also:
Constant Field Values

SEED_DISPOSITION_DISREGARD

static final java.lang.String SEED_DISPOSITION_DISREGARD
Seed was disregarded

See Also:
Constant Field Values

SEED_DISPOSITION_NOT_PROCESSED

static final java.lang.String SEED_DISPOSITION_NOT_PROCESSED
Seed has not been processed

See Also:
Constant Field Values
Method Detail

initialize

void initialize(CrawlController c)
                throws FatalConfigurationException
Do initialization. The CrawlController will call this method before calling the start() method.

Parameters:
c - The CrawlController running the crawl that this class is to gather statistics on.
Throws:
FatalConfigurationException

crawlDuration

long crawlDuration()
Returns how long the current crawl has been running (excluding any time spent paused/suspended/stopped) since it began.

Returns:
The length of time - in msec - that this crawl has been running.

noteStart

void noteStart()
Start the tracker's crawl timing.


totalBytesWritten

long totalBytesWritten()
Deprecated. misnomer; use totalBytesCrawled instead

Returns the total number of uncompressed bytes processed. Stored data may be much smaller due to compression or duplicate-reduction policies.

Returns:
The total number of uncompressed bytes written to disk

totalBytesCrawled

long totalBytesCrawled()
Returns the total number of uncompressed bytes crawled. Stored data may be much smaller due to compression or duplicate-reduction policies.

Returns:
The total number of uncompressed bytes crawled

getCrawlerTotalElapsedTime

long getCrawlerTotalElapsedTime()
Total amount of time spent actively crawling so far.

Returns the total amount of time (in milliseconds) that has elapsed from the start of the crawl and until the current time or if the crawl has ended until the the end of the crawl minus any time spent paused.

Returns:
Total amount of time (in msec.) spent crawling so far.

currentProcessedDocsPerSec

double currentProcessedDocsPerSec()
Returns an estimate of recent document download rates based on a queue of recently seen CrawlURIs (as of last snapshot).

Returns:
The rate per second of documents gathered during the last snapshot

processedDocsPerSec

double processedDocsPerSec()
Returns the number of documents that have been processed per second over the life of the crawl (as of last snapshot)

Returns:
The rate per second of documents gathered so far

processedKBPerSec

long processedKBPerSec()
Calculates the rate that data, in kb, has been processed over the life of the crawl (as of last snapshot.)

Returns:
The rate per second of KB gathered so far

currentProcessedKBPerSec

int currentProcessedKBPerSec()
Calculates an estimate of the rate, in kb, at which documents are currently being processed by the crawler. For more accurate estimates set a larger queue size, or get and average multiple values (as of last snapshot).

Returns:
The rate per second of KB gathered during the last snapshot

activeThreadCount

int activeThreadCount()
Get the number of active (non-paused) threads.

Returns:
The number of active (non-paused) threads

successfullyFetchedCount

long successfullyFetchedCount()
Number of successfully processed URIs.

If crawl not running (paused or stopped) this will return the value of the last snapshot.

Returns:
The number of successully fetched URIs
See Also:
Frontier.succeededFetchCount()

totalCount

long totalCount()
Returns:
Total number of URIs (processed + queued + currently being processed)

congestionRatio

float congestionRatio()

deepestUri

long deepestUri()

averageDepth

long averageDepth()

getSeedRecordsSortedByStatusCode

java.util.Iterator getSeedRecordsSortedByStatusCode()
Get a SeedRecord iterator for the job being monitored. If job is no longer running, stored values will be returned. If job is running, current seed iterator will be fetched and stored values will be updated.

Sort order is:
No status code (not processed)
Status codes smaller then 0 (largest to smallest)
Status codes larger then 0 (largest to smallest)

Note: This iterator will iterate over a list of SeedRecords.

Returns:
the seed iterator

progressStatisticsLegend

java.lang.String progressStatisticsLegend()
Returns:
legend of progress-statistics

getProgressStatisticsLine

java.lang.String getProgressStatisticsLine()
Returns:
line of progress-statistics

getProgressStatistics

java.util.Map getProgressStatistics()
Returns:
Map of progress-statistics.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.