org.archive.crawler.admin
Class StatisticsTracker

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.AbstractTracker
                      extended by org.archive.crawler.admin.StatisticsTracker
All Implemented Interfaces:
java.io.Serializable, java.lang.Runnable, javax.management.DynamicMBean, CrawlStatusListener, CrawlURIDispositionListener, StatisticsTracking

public class StatisticsTracker
extends AbstractTracker
implements CrawlURIDispositionListener, java.io.Serializable

This is an implementation of the AbstractTracker. It is designed to function with the WUI as well as performing various logging activity.

At the end of each snapshot a line is written to the 'progress-statistics.log' file.

The header of that file is as follows:

 [timestamp] [discovered]    [queued] [downloaded] [doc/s(avg)]  [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]
First there is a timestamp, accurate down to 1 second.

discovered, queued, downloaded and dl-failures are (respectively) the discovered URI count, pending URI count, successfully fetched count and failed fetch count from the frontier at the time of the snapshot.

KB/s(avg) is the bandwidth usage. We use the total bytes downloaded to calculate average bandwidth usage (KB/sec). Since we also note the value each time a snapshot is made we can calculate the average bandwidth usage during the last snapshot period to gain a "current" rate. The first number is the current and the average is in parenthesis.

doc/s(avg) works the same way as doc/s except it show the number of documents (URIs) rather then KB downloaded.

busy-threads is the total number of ToeThreads that are not available (and thus presumably busy processing a URI). This information is extracted from the crawl controller.

Finally mem-use-KB is extracted from the run time environment (Runtime.getRuntime().totalMemory()).

In addition to the data collected for the above logs, various other data is gathered and stored by this tracker.

Author:
Parker Thompson, Kristinn Sigurdsson
See Also:
StatisticsTracking, AbstractTracker, Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
protected  long averageDepth
           
protected  int busyThreads
           
protected  float congestionRatio
           
protected  CrawledBytesHistotable crawledBytes
          tally sizes novel, verified (same hash), vouched (not-modified)
protected  double currentDocsPerSecond
           
protected  int currentKBPerSec
           
protected  long deepestUri
           
protected  long discoveredUriCount
           
protected  double docsPerSecond
           
protected  long downloadDisregards
           
protected  long downloadedUriCount
           
protected  long downloadFailures
           
protected  long dupByHashUriCount
           
protected  long finishedUriCount
           
protected  ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsBytes
           
protected  ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDistribution
          Keep track of hosts.
protected  ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsLastFinished
           
protected  long lastPagesFetchedCount
           
protected  long lastProcessedBytesCount
           
protected  java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeBytes
           
protected  java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDistribution
          Keep track of the file types we see (mime type -> count)
protected  long notModifiedUriCount
           
protected  long novelUriCount
           
protected  ObjectIdentityCache<java.lang.String,SeedRecord> processedSeedsRecords
          Record of seeds' latest actions.
protected  long queuedUriCount
           
protected  ObjectIdentityCache<java.lang.String,java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>> sourceHostDistribution
          Keep track of URL counts per host per seed
protected  java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> statusCodeDistribution
          Keep track of fetch status codes
protected  long totalKBPerSec
           
protected  long totalProcessedBytes
           
 
Fields inherited from class org.archive.crawler.framework.AbstractTracker
ATTR_STATS_INTERVAL, controller, crawlerEndTime, crawlerPauseStarted, crawlerStartTime, crawlerTotalPausedTime, DEFAULT_STATISTICS_REPORT_INTERVAL, lastLogPointTime, shouldrun
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.framework.StatisticsTracking
SEED_DISPOSITION_DISREGARD, SEED_DISPOSITION_FAILURE, SEED_DISPOSITION_NOT_PROCESSED, SEED_DISPOSITION_RETRY, SEED_DISPOSITION_SUCCESS
 
Constructor Summary
StatisticsTracker(java.lang.String name)
           
 
Method Summary
 int activeThreadCount()
          Get the number of active (non-paused) threads.
 long averageDepth()
          Average depth of the last URI in all eligible queues.
 float congestionRatio()
          Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads.
 void crawlCheckpoint(java.io.File cpDir)
          Called by CrawlController when checkpointing.
 java.lang.String crawledBytesSummary()
           
 void crawledURIDisregard(CrawlURI curi)
          Notification of a crawled URI that is to be disregarded.
 void crawledURIFailure(CrawlURI curi)
          Notification of a failed crawling of a URI.
 void crawledURINeedRetry(CrawlURI curi)
          Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems).
 void crawledURISuccessful(CrawlURI curi)
          Notification of a successfully crawled URI
 void crawlEnded(java.lang.String message)
          Called when a CrawlController has ended a crawl and is about to exit.
 double currentProcessedDocsPerSec()
          Returns an estimate of recent document download rates based on a queue of recently seen CrawlURIs (as of last snapshot).
 int currentProcessedKBPerSec()
          Calculates an estimate of the rate, in kb, at which documents are currently being processed by the crawler.
 long deepestUri()
          Ordinal position of the 'deepest' URI eligible for crawling.
 long discoveredUriCount()
          Number of discovered URIs.
 long disregardedFetchAttempts()
          Get the total number of failed fetch attempts (connection failures -> give up, etc)
 void dumpReports()
          Run the reports.
 long failedFetchAttempts()
          Get the total number of failed fetch attempts (connection failures -> give up, etc)
protected  void finalCleanup()
          Cleanup resources used, at crawl end.
 long finishedUriCount()
          Number of URIs that have finished processing.
protected  java.lang.String fixup(java.lang.String hostName)
           
 long getBytesPerFileType(java.lang.String filetype)
          Returns the accumulated number of bytes from files of a given file type.
 long getBytesPerHost(java.lang.String host)
          Returns the accumulated number of bytes downloaded from a given host.
 java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> getFileDistribution()
          Returns a HashMap that contains information about distributions of encountered mime types.
 java.util.concurrent.atomic.AtomicLong getHostLastFinished(java.lang.String host)
          Returns the time (in millisec) when a URI belonging to a given host was last finished processing.
 java.util.Map<java.lang.String,java.lang.Number> getProgressStatistics()
           
 java.lang.String getProgressStatisticsLine()
          Return one line of current progress-statistics
 java.lang.String getProgressStatisticsLine(java.util.Date now)
          Return one line of current progress-statistics
 java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
          Sort the entries of the given HashMap in descending order by their values, which must be longs wrapped with AtomicLong.
 java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
          Sort the entries of the given ObjectIdentityCache in descending order by their values, which must be longs wrapped with AtomicLong.
 java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedHostCounts(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostCounts)
          Return a copy of the hosts distribution in reverse-sorted (largest first) order.
 java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedHostsDistribution()
          Return a copy of the hosts distribution in reverse-sorted (largest first) order.
 java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode()
          Get a SeedRecord iterator for the job being monitored.
protected  java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode(java.util.Iterator<java.lang.String> i)
           
 java.util.Iterator<java.lang.String> getSeeds()
          Get a seed iterator for the job being monitored.
 java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> getStatusCodeDistribution()
          Return a HashMap representing the distribution of status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count.
protected static void incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache, java.lang.String key)
          Increment a counter for a key in a given cache.
protected static void incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache, java.lang.String key, long increment)
          Increment a counter for a key in a given cache by an arbitrary amount.
protected static void incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key)
          Increment a counter for a key in a given HashMap.
protected static void incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key, long increment)
          Increment a counter for a key in a given HashMap by an arbitrary amount.
 void initialize(CrawlController c)
          Sets up the Logger (including logInterval) and registers with the CrawlController for CrawlStatus and CrawlURIDisposition events.
 int percentOfDiscoveredUrisCompleted()
          This returns the number of completed URIs as a percentage of the total number of URIs encountered (should be inverse to the discovery curve)
 double processedDocsPerSec()
          Returns the number of documents that have been processed per second over the life of the crawl (as of last snapshot)
 long processedKBPerSec()
          Calculates the rate that data, in kb, has been processed over the life of the crawl (as of last snapshot.)
protected  void progressStatisticsEvent(java.util.EventObject e)
          A method for logging current crawler state.
 long queuedUriCount()
          Number of URIs queued up and waiting for processing.
protected  void saveHostStats(java.lang.String hostname, long size)
           
protected  void saveSourceStats(java.lang.String source, java.lang.String hostname)
           
 long successfullyFetchedCount()
          Number of successfully processed URIs.
 int threadCount()
          Get the total number of ToeThreads (sleeping and active)
 long totalBytesCrawled()
          Returns the total number of uncompressed bytes crawled.
 long totalBytesWritten()
          Deprecated. use totalBytesCrawled
 long totalCount()
           
protected  void writeCrawlReportTo(java.io.PrintWriter writer)
           
protected  void writeFrontierReportTo(java.io.PrintWriter writer)
          Write the Frontier's 'nonempty' report (if available)
protected  void writeHostsReportTo(java.io.PrintWriter writer)
           
protected  void writeManifestReportTo(java.io.PrintWriter writer)
           
protected  void writeMimetypesReportTo(java.io.PrintWriter writer)
           
protected  void writeProcessorsReportTo(java.io.PrintWriter writer)
           
protected  void writeReportFile(java.lang.String reportName, java.lang.String filename)
           
protected  void writeReportLine(java.io.PrintWriter writer, java.lang.Object... fields)
           
protected  void writeResponseCodeReportTo(java.io.PrintWriter writer)
           
protected  void writeSeedsReportTo(java.io.PrintWriter writer)
           
protected  void writeSourceReportTo(java.io.PrintWriter writer)
           
 
Methods inherited from class org.archive.crawler.framework.AbstractTracker
crawlDuration, crawlEnding, crawlPaused, crawlPausing, crawlResuming, crawlStarted, getCrawlEndTime, getCrawlerTotalElapsedTime, getCrawlPauseStartedTime, getCrawlStartTime, getCrawlTotalPauseTime, getLogWriteInterval, logNote, noteStart, progressStatisticsLegend, run, tallyCurrentPause
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

lastPagesFetchedCount

protected long lastPagesFetchedCount

lastProcessedBytesCount

protected long lastProcessedBytesCount

discoveredUriCount

protected long discoveredUriCount

queuedUriCount

protected long queuedUriCount

finishedUriCount

protected long finishedUriCount

downloadedUriCount

protected long downloadedUriCount

downloadFailures

protected long downloadFailures

downloadDisregards

protected long downloadDisregards

docsPerSecond

protected double docsPerSecond

currentDocsPerSecond

protected double currentDocsPerSecond

currentKBPerSec

protected int currentKBPerSec

totalKBPerSec

protected long totalKBPerSec

busyThreads

protected int busyThreads

totalProcessedBytes

protected long totalProcessedBytes

congestionRatio

protected float congestionRatio

deepestUri

protected long deepestUri

averageDepth

protected long averageDepth

crawledBytes

protected CrawledBytesHistotable crawledBytes
tally sizes novel, verified (same hash), vouched (not-modified)


notModifiedUriCount

protected long notModifiedUriCount

dupByHashUriCount

protected long dupByHashUriCount

novelUriCount

protected long novelUriCount

mimeTypeDistribution

protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDistribution
Keep track of the file types we see (mime type -> count)


mimeTypeBytes

protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeBytes

statusCodeDistribution

protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> statusCodeDistribution
Keep track of fetch status codes


hostsDistribution

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDistribution
Keep track of hosts.

They're transient because usually bigmaps that get reconstituted on recover from checkpoint.


hostsBytes

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsBytes

hostsLastFinished

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsLastFinished

sourceHostDistribution

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>> sourceHostDistribution
Keep track of URL counts per host per seed


processedSeedsRecords

protected transient ObjectIdentityCache<java.lang.String,SeedRecord> processedSeedsRecords
Record of seeds' latest actions.

Constructor Detail

StatisticsTracker

public StatisticsTracker(java.lang.String name)
Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException
Description copied from class: AbstractTracker
Sets up the Logger (including logInterval) and registers with the CrawlController for CrawlStatus and CrawlURIDisposition events.

Specified by:
initialize in interface StatisticsTracking
Overrides:
initialize in class AbstractTracker
Parameters:
c - A crawl controller instance.
Throws:
FatalConfigurationException - Not thrown here. For overrides that go to settings system for configuration.
See Also:
CrawlStatusListener, CrawlURIDispositionListener

finalCleanup

protected void finalCleanup()
Description copied from class: AbstractTracker
Cleanup resources used, at crawl end.

Overrides:
finalCleanup in class AbstractTracker

progressStatisticsEvent

protected void progressStatisticsEvent(java.util.EventObject e)
Description copied from class: AbstractTracker
A method for logging current crawler state. This method will be called by run() at intervals specified in the crawl order file. It is also invoked when pausing or stopping a crawl to capture the state at that point. Default behavior is call to CrawlController.logProgressStatistics(java.lang.String) so CrawlController can act on progress statistics event.

It is recommended that for implementations of this method it be carefully considered if it should be synchronized in whole or in part

Overrides:
progressStatisticsEvent in class AbstractTracker
Parameters:
e - Progress statistics event.

getProgressStatisticsLine

public java.lang.String getProgressStatisticsLine(java.util.Date now)
Return one line of current progress-statistics

Parameters:
now -
Returns:
String of stats

getProgressStatistics

public java.util.Map<java.lang.String,java.lang.Number> getProgressStatistics()
Specified by:
getProgressStatistics in interface StatisticsTracking
Returns:
Map of progress-statistics.

getProgressStatisticsLine

public java.lang.String getProgressStatisticsLine()
Return one line of current progress-statistics

Specified by:
getProgressStatisticsLine in interface StatisticsTracking
Returns:
String of stats

processedDocsPerSec

public double processedDocsPerSec()
Description copied from interface: StatisticsTracking
Returns the number of documents that have been processed per second over the life of the crawl (as of last snapshot)

Specified by:
processedDocsPerSec in interface StatisticsTracking
Returns:
The rate per second of documents gathered so far

currentProcessedDocsPerSec

public double currentProcessedDocsPerSec()
Description copied from interface: StatisticsTracking
Returns an estimate of recent document download rates based on a queue of recently seen CrawlURIs (as of last snapshot).

Specified by:
currentProcessedDocsPerSec in interface StatisticsTracking
Returns:
The rate per second of documents gathered during the last snapshot

processedKBPerSec

public long processedKBPerSec()
Description copied from interface: StatisticsTracking
Calculates the rate that data, in kb, has been processed over the life of the crawl (as of last snapshot.)

Specified by:
processedKBPerSec in interface StatisticsTracking
Returns:
The rate per second of KB gathered so far

currentProcessedKBPerSec

public int currentProcessedKBPerSec()
Description copied from interface: StatisticsTracking
Calculates an estimate of the rate, in kb, at which documents are currently being processed by the crawler. For more accurate estimates set a larger queue size, or get and average multiple values (as of last snapshot).

Specified by:
currentProcessedKBPerSec in interface StatisticsTracking
Returns:
The rate per second of KB gathered during the last snapshot

getFileDistribution

public java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> getFileDistribution()
Returns a HashMap that contains information about distributions of encountered mime types. Key/value pairs represent mime type -> count.

Note: All the values are wrapped with a AtomicLong

Returns:
mimeTypeDistribution

incrementMapCount

protected static void incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
                                        java.lang.String key)
Increment a counter for a key in a given HashMap. Used for various aggregate data. As this is used to change Maps which depend on StatisticsTracker for their synchronization, this method should only be invoked from a a block synchronized on 'this'.

Parameters:
map - The HashMap
key - The key for the counter to be incremented, if it does not exist it will be added (set to 1). If null it will increment the counter "unknown".

incrementCacheCount

protected static void incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache,
                                          java.lang.String key)
Increment a counter for a key in a given cache. Used for various aggregate data.

Parameters:
cache - the ObjectIdentityCache
key - The key for the counter to be incremented, if it does not exist it will be added (set to 1). If null it will increment the counter "unknown".

incrementCacheCount

protected static void incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache,
                                          java.lang.String key,
                                          long increment)
Increment a counter for a key in a given cache by an arbitrary amount. Used for various aggregate data. The increment amount can be negative.

Parameters:
cache - The ObjectIdentityCache
key - The key for the counter to be incremented, if it does not exist it will be added (set to equal to increment). If null it will increment the counter "unknown".
increment - The amount to increment counter related to the key.

incrementMapCount

protected static void incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
                                        java.lang.String key,
                                        long increment)
Increment a counter for a key in a given HashMap by an arbitrary amount. Used for various aggregate data. The increment amount can be negative.

Parameters:
map - The Map or ConcurrentMap
key - The key for the counter to be incremented, if it does not exist it will be added (set to equal to increment). If null it will increment the counter "unknown".
increment - The amount to increment counter related to the key.

getReverseSortedCopy

public java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
Sort the entries of the given HashMap in descending order by their values, which must be longs wrapped with AtomicLong.

Elements are sorted by value from largest to smallest. Equal values are sorted in an arbitrary, but consistent manner by their keys. Only items with identical value and key are considered equal. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.

Parameters:
mapOfAtomicLongValues - Assumes values are wrapped with AtomicLong.
Returns:
a sorted set containing the same elements as the map.

getReverseSortedCopy

public java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
Sort the entries of the given ObjectIdentityCache in descending order by their values, which must be longs wrapped with AtomicLong.

Elements are sorted by value from largest to smallest. Equal values are sorted in an arbitrary, but consistent manner by their keys. Only items with identical value and key are considered equal. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.

Parameters:
mapOfAtomicLongValues - Assumes values are wrapped with AtomicLong.
Returns:
a sorted set containing the same elements as the map.

getStatusCodeDistribution

public java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> getStatusCodeDistribution()
Return a HashMap representing the distribution of status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. Note: All the values are wrapped with a AtomicLong

Returns:
statusCodeDistribution

getHostLastFinished

public java.util.concurrent.atomic.AtomicLong getHostLastFinished(java.lang.String host)
Returns the time (in millisec) when a URI belonging to a given host was last finished processing.

Parameters:
host - The host to look up time of last completed URI.
Returns:
Returns the time (in millisec) when a URI belonging to a given host was last finished processing. If no URI has been completed for host -1 will be returned.

getBytesPerHost

public long getBytesPerHost(java.lang.String host)
Returns the accumulated number of bytes downloaded from a given host.

Parameters:
host - name of the host
Returns:
the accumulated number of bytes downloaded from a given host

getBytesPerFileType

public long getBytesPerFileType(java.lang.String filetype)
Returns the accumulated number of bytes from files of a given file type.

Parameters:
filetype - Filetype to check.
Returns:
the accumulated number of bytes from files of a given mime type

threadCount

public int threadCount()
Get the total number of ToeThreads (sleeping and active)

Returns:
The total number of ToeThreads

activeThreadCount

public int activeThreadCount()
Description copied from interface: StatisticsTracking
Get the number of active (non-paused) threads.

Specified by:
activeThreadCount in interface StatisticsTracking
Returns:
Current thread count (or zero if can't figure it out).

percentOfDiscoveredUrisCompleted

public int percentOfDiscoveredUrisCompleted()
This returns the number of completed URIs as a percentage of the total number of URIs encountered (should be inverse to the discovery curve)

Returns:
The number of completed URIs as a percentage of the total number of URIs encountered

discoveredUriCount

public long discoveredUriCount()
Number of discovered URIs.

If crawl not running (paused or stopped) this will return the value of the last snapshot.

Returns:
A count of all uris encountered
See Also:
Frontier.discoveredUriCount()

finishedUriCount

public long finishedUriCount()
Number of URIs that have finished processing.

Returns:
Number of URIs that have finished processing
See Also:
Frontier.finishedUriCount()

failedFetchAttempts

public long failedFetchAttempts()
Get the total number of failed fetch attempts (connection failures -> give up, etc)

Returns:
The total number of failed fetch attempts

disregardedFetchAttempts

public long disregardedFetchAttempts()
Get the total number of failed fetch attempts (connection failures -> give up, etc)

Returns:
The total number of failed fetch attempts

successfullyFetchedCount

public long successfullyFetchedCount()
Description copied from interface: StatisticsTracking
Number of successfully processed URIs.

If crawl not running (paused or stopped) this will return the value of the last snapshot.

Specified by:
successfullyFetchedCount in interface StatisticsTracking
Returns:
The number of successully fetched URIs
See Also:
Frontier.succeededFetchCount()

totalCount

public long totalCount()
Specified by:
totalCount in interface StatisticsTracking
Returns:
Total number of URIs (processed + queued + currently being processed)

congestionRatio

public float congestionRatio()
Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads.

Specified by:
congestionRatio in interface StatisticsTracking
Returns:
float congestion ratio

deepestUri

public long deepestUri()
Ordinal position of the 'deepest' URI eligible for crawling. Essentially, the length of the longest frontier internal queue.

Specified by:
deepestUri in interface StatisticsTracking
Returns:
long URI count to deepest URI

averageDepth

public long averageDepth()
Average depth of the last URI in all eligible queues. That is, the average length of all eligible queues.

Specified by:
averageDepth in interface StatisticsTracking
Returns:
long average depth of last URIs in queues

queuedUriCount

public long queuedUriCount()
Number of URIs queued up and waiting for processing.

If crawl not running (paused or stopped) this will return the value of the last snapshot.

Returns:
Number of URIs queued up and waiting for processing.
See Also:
Frontier.queuedUriCount()

totalBytesWritten

public long totalBytesWritten()
Deprecated. use totalBytesCrawled

Description copied from interface: StatisticsTracking
Returns the total number of uncompressed bytes processed. Stored data may be much smaller due to compression or duplicate-reduction policies.

Specified by:
totalBytesWritten in interface StatisticsTracking
Returns:
The total number of uncompressed bytes written to disk

totalBytesCrawled

public long totalBytesCrawled()
Description copied from interface: StatisticsTracking
Returns the total number of uncompressed bytes crawled. Stored data may be much smaller due to compression or duplicate-reduction policies.

Specified by:
totalBytesCrawled in interface StatisticsTracking
Returns:
The total number of uncompressed bytes crawled

crawledBytesSummary

public java.lang.String crawledBytesSummary()

crawledURISuccessful

public void crawledURISuccessful(CrawlURI curi)
Description copied from interface: CrawlURIDispositionListener
Notification of a successfully crawled URI

Specified by:
crawledURISuccessful in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI

saveSourceStats

protected void saveSourceStats(java.lang.String source,
                               java.lang.String hostname)

saveHostStats

protected void saveHostStats(java.lang.String hostname,
                             long size)

crawledURINeedRetry

public void crawledURINeedRetry(CrawlURI curi)
Description copied from interface: CrawlURIDispositionListener
Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems).

Specified by:
crawledURINeedRetry in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI

crawledURIDisregard

public void crawledURIDisregard(CrawlURI curi)
Description copied from interface: CrawlURIDispositionListener
Notification of a crawled URI that is to be disregarded. Usually this means that the robots.txt file for the relevant site forbids this from being crawled and we are therefor not going to keep it. Other reasons may apply. In all cases this means that it was successfully downloaded but will not be stored.

Specified by:
crawledURIDisregard in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI

crawledURIFailure

public void crawledURIFailure(CrawlURI curi)
Description copied from interface: CrawlURIDispositionListener
Notification of a failed crawling of a URI. The failure is of a type that precludes retries (either by it's very nature or because it has been retried to many times)

Specified by:
crawledURIFailure in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI

getSeeds

public java.util.Iterator<java.lang.String> getSeeds()
Get a seed iterator for the job being monitored. Note: This iterator will iterate over a list of strings not UURIs like the Scope seed iterator. The strings are equal to the URIs' getURIString() values.

Returns:
the seed iterator FIXME: Consider using TransformingIterator here

getSeedRecordsSortedByStatusCode

public java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode()
Description copied from interface: StatisticsTracking
Get a SeedRecord iterator for the job being monitored. If job is no longer running, stored values will be returned. If job is running, current seed iterator will be fetched and stored values will be updated.

Sort order is:
No status code (not processed)
Status codes smaller then 0 (largest to smallest)
Status codes larger then 0 (largest to smallest)

Note: This iterator will iterate over a list of SeedRecords.

Specified by:
getSeedRecordsSortedByStatusCode in interface StatisticsTracking
Returns:
the seed iterator

getSeedRecordsSortedByStatusCode

protected java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode(java.util.Iterator<java.lang.String> i)

crawlEnded

public void crawlEnded(java.lang.String message)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Overrides:
crawlEnded in class AbstractTracker
Parameters:
message - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlStatusListener.crawlEnded(java.lang.String)

writeSeedsReportTo

protected void writeSeedsReportTo(java.io.PrintWriter writer)
Parameters:
writer - Where to write.

writeSourceReportTo

protected void writeSourceReportTo(java.io.PrintWriter writer)

getReverseSortedHostCounts

public java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedHostCounts(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostCounts)
Return a copy of the hosts distribution in reverse-sorted (largest first) order.

Returns:
SortedMap of hosts distribution

writeHostsReportTo

protected void writeHostsReportTo(java.io.PrintWriter writer)

fixup

protected java.lang.String fixup(java.lang.String hostName)

writeReportLine

protected void writeReportLine(java.io.PrintWriter writer,
                               java.lang.Object... fields)

getReverseSortedHostsDistribution

public java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedHostsDistribution()
Return a copy of the hosts distribution in reverse-sorted (largest first) order.

Returns:
SortedMap of hosts distribution

writeMimetypesReportTo

protected void writeMimetypesReportTo(java.io.PrintWriter writer)

writeResponseCodeReportTo

protected void writeResponseCodeReportTo(java.io.PrintWriter writer)

writeCrawlReportTo

protected void writeCrawlReportTo(java.io.PrintWriter writer)

writeProcessorsReportTo

protected void writeProcessorsReportTo(java.io.PrintWriter writer)

writeReportFile

protected void writeReportFile(java.lang.String reportName,
                               java.lang.String filename)

writeManifestReportTo

protected void writeManifestReportTo(java.io.PrintWriter writer)
Parameters:
writer - Where to write.

writeFrontierReportTo

protected void writeFrontierReportTo(java.io.PrintWriter writer)
Write the Frontier's 'nonempty' report (if available)

Parameters:
writer - to report to

dumpReports

public void dumpReports()
Run the reports.

Overrides:
dumpReports in class AbstractTracker

crawlCheckpoint

public void crawlCheckpoint(java.io.File cpDir)
                     throws java.lang.Exception
Description copied from interface: CrawlStatusListener
Called by CrawlController when checkpointing.

Specified by:
crawlCheckpoint in interface CrawlStatusListener
Parameters:
cpDir - Checkpoint dir. Write checkpoint state here.
Throws:
java.lang.Exception - A fatal exception. Any exceptions that are let out of this checkpoint are assumed fatal and terminate further checkpoint processing.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.