StatisticsTracker (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.admin
Class StatisticsTracker

java.lang.Object
  javax.management.Attribute
      org.archive.crawler.settings.Type
          org.archive.crawler.settings.ComplexType
              org.archive.crawler.settings.ModuleType
                  org.archive.crawler.framework.AbstractTracker
                      org.archive.crawler.admin.StatisticsTracker

All Implemented Interfaces:: java.io.Serializable, java.lang.Runnable, javax.management.DynamicMBean, CrawlStatusListener, CrawlURIDispositionListener, StatisticsTracking

public class StatisticsTracker
extends AbstractTracker
implements CrawlURIDispositionListener, java.io.Serializable
extends AbstractTracker
implements CrawlURIDispositionListener, java.io.Serializable

This is an implementation of the AbstractTracker. It is designed to function with the WUI as well as performing various logging activity.

At the end of each snapshot a line is written to the 'progress-statistics.log' file.

The header of that file is as follows:

 [timestamp] [discovered]    [queued] [downloaded] [doc/s(avg)]  [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]

First there is a timestamp, accurate down to 1 second.

discovered, queued, downloaded and dl-failures are (respectively) the discovered URI count, pending URI count, successfully fetched count and failed fetch count from the frontier at the time of the snapshot.

KB/s(avg) is the bandwidth usage. We use the total bytes downloaded to calculate average bandwidth usage (KB/sec). Since we also note the value each time a snapshot is made we can calculate the average bandwidth usage during the last snapshot period to gain a "current" rate. The first number is the current and the average is in parenthesis.

doc/s(avg) works the same way as doc/s except it show the number of documents (URIs) rather then KB downloaded.

busy-threads is the total number of ToeThreads that are not available (and thus presumably busy processing a URI). This information is extracted from the crawl controller.

Finally mem-use-KB is extracted from the run time environment (Runtime.getRuntime().totalMemory()).

In addition to the data collected for the above logs, various other data is gathered and stored by this tracker.

Successfully downloaded documents per fetch status code
Successfully downloaded documents per document mime type
Amount of data per mime type
Successfully downloaded documents per host
Amount of data per host
Disposition of all seeds (this is written to 'reports.log' at end of crawl)
Successfully downloaded documents per host per source

Author:: Parker Thompson, Kristinn Sigurdsson
See Also:: StatisticsTracking, AbstractTracker, Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
`ComplexType.MBeanAttributeInfoIterator`

Field Summary
`protected long`	`averageDepth`
`protected int`	`busyThreads`
`protected float`	`congestionRatio`
`protected CrawledBytesHistotable`	`crawledBytes` tally sizes novel, verified (same hash), vouched (not-modified)
`protected double`	`currentDocsPerSecond`
`protected int`	`currentKBPerSec`
`protected long`	`deepestUri`
`protected long`	`discoveredUriCount`
`protected double`	`docsPerSecond`
`protected long`	`downloadDisregards`
`protected long`	`downloadedUriCount`
`protected long`	`downloadFailures`
`protected long`	`dupByHashUriCount`
`protected long`	`finishedUriCount`
`protected ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`hostsBytes`
`protected ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`hostsDistribution` Keep track of hosts.
`protected ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`hostsLastFinished`
`protected long`	`lastPagesFetchedCount`
`protected long`	`lastProcessedBytesCount`
`protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`mimeTypeBytes`
`protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`mimeTypeDistribution` Keep track of the file types we see (mime type -> count)
`protected long`	`notModifiedUriCount`
`protected long`	`novelUriCount`
`protected ObjectIdentityCache<java.lang.String,SeedRecord>`	`processedSeedsRecords` Record of seeds' latest actions.
`protected long`	`queuedUriCount`
`protected ObjectIdentityCache<java.lang.String,java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>>`	`sourceHostDistribution` Keep track of URL counts per host per seed
`protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`statusCodeDistribution` Keep track of fetch status codes
`protected long`	`totalKBPerSec`
`protected long`	`totalProcessedBytes`

Fields inherited from class org.archive.crawler.framework.AbstractTracker
`ATTR_STATS_INTERVAL, controller, crawlerEndTime, crawlerPauseStarted, crawlerStartTime, crawlerTotalPausedTime, DEFAULT_STATISTICS_REPORT_INTERVAL, lastLogPointTime, shouldrun`

Fields inherited from class org.archive.crawler.settings.ComplexType
`definition, definitionMap`

Fields inherited from interface org.archive.crawler.framework.StatisticsTracking
`SEED_DISPOSITION_DISREGARD, SEED_DISPOSITION_FAILURE, SEED_DISPOSITION_NOT_PROCESSED, SEED_DISPOSITION_RETRY, SEED_DISPOSITION_SUCCESS`

Constructor Summary
`StatisticsTracker(java.lang.String name)`

Method Summary
`int`	`activeThreadCount()` Get the number of active (non-paused) threads.
`long`	`averageDepth()` Average depth of the last URI in all eligible queues.
`float`	`congestionRatio()` Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads.
`void`	`crawlCheckpoint(java.io.File cpDir)` Called by `CrawlController` when checkpointing.
`java.lang.String`	`crawledBytesSummary()`
`void`	`crawledURIDisregard(CrawlURI curi)` Notification of a crawled URI that is to be disregarded.
`void`	`crawledURIFailure(CrawlURI curi)` Notification of a failed crawling of a URI.
`void`	`crawledURINeedRetry(CrawlURI curi)` Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems).
`void`	`crawledURISuccessful(CrawlURI curi)` Notification of a successfully crawled URI
`void`	`crawlEnded(java.lang.String message)` Called when a CrawlController has ended a crawl and is about to exit.
`double`	`currentProcessedDocsPerSec()` Returns an estimate of recent document download rates based on a queue of recently seen CrawlURIs (as of last snapshot).
`int`	`currentProcessedKBPerSec()` Calculates an estimate of the rate, in kb, at which documents are currently being processed by the crawler.
`long`	`deepestUri()` Ordinal position of the 'deepest' URI eligible for crawling.
`long`	`discoveredUriCount()` Number of discovered URIs.
`long`	`disregardedFetchAttempts()` Get the total number of failed fetch attempts (connection failures -> give up, etc)
`void`	`dumpReports()` Run the reports.
`long`	`failedFetchAttempts()` Get the total number of failed fetch attempts (connection failures -> give up, etc)
`protected void`	`finalCleanup()` Cleanup resources used, at crawl end.
`long`	`finishedUriCount()` Number of URIs that have finished processing.
`protected java.lang.String`	`fixup(java.lang.String hostName)`
`long`	`getBytesPerFileType(java.lang.String filetype)` Returns the accumulated number of bytes from files of a given file type.
`long`	`getBytesPerHost(java.lang.String host)` Returns the accumulated number of bytes downloaded from a given host.
`java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`getFileDistribution()` Returns a HashMap that contains information about distributions of encountered mime types.
`java.util.concurrent.atomic.AtomicLong`	`getHostLastFinished(java.lang.String host)` Returns the time (in millisec) when a URI belonging to a given host was last finished processing.
`java.util.Map<java.lang.String,java.lang.Number>`	`getProgressStatistics()`
`java.lang.String`	`getProgressStatisticsLine()` Return one line of current progress-statistics
`java.lang.String`	`getProgressStatisticsLine(java.util.Date now)` Return one line of current progress-statistics
`java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)` Sort the entries of the given HashMap in descending order by their values, which must be longs wrapped with `AtomicLong`.
`java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`getReverseSortedCopy(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)` Sort the entries of the given ObjectIdentityCache in descending order by their values, which must be longs wrapped with `AtomicLong`.
`java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`getReverseSortedHostCounts(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostCounts)` Return a copy of the hosts distribution in reverse-sorted (largest first) order.
`java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`getReverseSortedHostsDistribution()` Return a copy of the hosts distribution in reverse-sorted (largest first) order.
`java.util.Iterator<SeedRecord>`	`getSeedRecordsSortedByStatusCode()` Get a SeedRecord iterator for the job being monitored.
`protected java.util.Iterator<SeedRecord>`	`getSeedRecordsSortedByStatusCode(java.util.Iterator<java.lang.String> i)`
`java.util.Iterator<java.lang.String>`	`getSeeds()` Get a seed iterator for the job being monitored.
`java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong>`	`getStatusCodeDistribution()` Return a HashMap representing the distribution of status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count.
`protected static void`	`incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache, java.lang.String key)` Increment a counter for a key in a given cache.
`protected static void`	`incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache, java.lang.String key, long increment)` Increment a counter for a key in a given cache by an arbitrary amount.
`protected static void`	`incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key)` Increment a counter for a key in a given HashMap.
`protected static void`	`incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key, long increment)` Increment a counter for a key in a given HashMap by an arbitrary amount.
`void`	`initialize(CrawlController c)` Sets up the Logger (including logInterval) and registers with the CrawlController for CrawlStatus and CrawlURIDisposition events.
`int`	`percentOfDiscoveredUrisCompleted()` This returns the number of completed URIs as a percentage of the total number of URIs encountered (should be inverse to the discovery curve)
`double`	`processedDocsPerSec()` Returns the number of documents that have been processed per second over the life of the crawl (as of last snapshot)
`long`	`processedKBPerSec()` Calculates the rate that data, in kb, has been processed over the life of the crawl (as of last snapshot.)
`protected void`	`progressStatisticsEvent(java.util.EventObject e)` A method for logging current crawler state.
`long`	`queuedUriCount()` Number of URIs queued up and waiting for processing.
`protected void`	`saveHostStats(java.lang.String hostname, long size)`
`protected void`	`saveSourceStats(java.lang.String source, java.lang.String hostname)`
`long`	`successfullyFetchedCount()` Number of successfully processed URIs.
`int`	`threadCount()` Get the total number of ToeThreads (sleeping and active)
`long`	`totalBytesCrawled()` Returns the total number of uncompressed bytes crawled.
`long`	`totalBytesWritten()` Deprecated. use totalBytesCrawled
`long`	`totalCount()`
`protected void`	`writeCrawlReportTo(java.io.PrintWriter writer)`
`protected void`	`writeFrontierReportTo(java.io.PrintWriter writer)` Write the Frontier's 'nonempty' report (if available)
`protected void`	`writeHostsReportTo(java.io.PrintWriter writer)`
`protected void`	`writeManifestReportTo(java.io.PrintWriter writer)`
`protected void`	`writeMimetypesReportTo(java.io.PrintWriter writer)`
`protected void`	`writeProcessorsReportTo(java.io.PrintWriter writer)`
`protected void`	`writeReportFile(java.lang.String reportName, java.lang.String filename)`
`protected void`	`writeReportLine(java.io.PrintWriter writer, java.lang.Object... fields)`
`protected void`	`writeResponseCodeReportTo(java.io.PrintWriter writer)`
`protected void`	`writeSeedsReportTo(java.io.PrintWriter writer)`
`protected void`	`writeSourceReportTo(java.io.PrintWriter writer)`

Methods inherited from class org.archive.crawler.framework.AbstractTracker
`crawlDuration, crawlEnding, crawlPaused, crawlPausing, crawlResuming, crawlStarted, getCrawlEndTime, getCrawlerTotalElapsedTime, getCrawlPauseStartedTime, getCrawlStartTime, getCrawlTotalPauseTime, getLogWriteInterval, logNote, noteStart, progressStatisticsLegend, run, tallyCurrentPause`

Methods inherited from class org.archive.crawler.settings.ModuleType
`addElement, listUsedFiles`

Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.ComplexType

addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.Type
`addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient`

Methods inherited from class javax.management.Attribute
`getName, hashCode`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

lastPagesFetchedCount

protected long lastPagesFetchedCount

lastProcessedBytesCount

protected long lastProcessedBytesCount

discoveredUriCount

protected long discoveredUriCount

queuedUriCount

protected long queuedUriCount

finishedUriCount

protected long finishedUriCount

downloadedUriCount

protected long downloadedUriCount

downloadFailures

protected long downloadFailures

downloadDisregards

protected long downloadDisregards

docsPerSecond

protected double docsPerSecond

currentDocsPerSecond

protected double currentDocsPerSecond

currentKBPerSec

protected int currentKBPerSec

totalKBPerSec

protected long totalKBPerSec

busyThreads

protected int busyThreads

totalProcessedBytes

protected long totalProcessedBytes

congestionRatio

protected float congestionRatio

deepestUri

protected long deepestUri

averageDepth

protected long averageDepth

crawledBytes

protected CrawledBytesHistotable crawledBytes

tally sizes novel, verified (same hash), vouched (not-modified)

notModifiedUriCount

protected long notModifiedUriCount

dupByHashUriCount

protected long dupByHashUriCount

novelUriCount

protected long novelUriCount

mimeTypeDistribution

protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDistribution

Keep track of the file types we see (mime type -> count)

mimeTypeBytes

protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeBytes

statusCodeDistribution

protected java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> statusCodeDistribution

Keep track of fetch status codes

hostsDistribution

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDistribution

Keep track of hosts.

They're transient because usually bigmaps that get reconstituted on recover from checkpoint.

hostsBytes

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsBytes

hostsLastFinished

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsLastFinished

sourceHostDistribution

protected transient ObjectIdentityCache<java.lang.String,java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong>> sourceHostDistribution

Keep track of URL counts per host per seed

processedSeedsRecords

protected transient ObjectIdentityCache<java.lang.String,SeedRecord> processedSeedsRecords

Record of seeds' latest actions.

Constructor Detail

StatisticsTracker

public StatisticsTracker(java.lang.String name)

Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException

Description copied from class: AbstractTracker

Sets up the Logger (including logInterval) and registers with the CrawlController for CrawlStatus and CrawlURIDisposition events.

Specified by:: initialize in interface StatisticsTracking
Overrides:: initialize in class AbstractTracker

Parameters:: c - A crawl controller instance.
Throws:: FatalConfigurationException - Not thrown here. For overrides that go to settings system for configuration.
See Also:: CrawlStatusListener, CrawlURIDispositionListener

finalCleanup

protected void finalCleanup()

Description copied from class: AbstractTracker

Cleanup resources used, at crawl end.

Overrides:: finalCleanup in class AbstractTracker

progressStatisticsEvent

protected void progressStatisticsEvent(java.util.EventObject e)

Description copied from class: AbstractTracker

A method for logging current crawler state. This method will be called by run() at intervals specified in the crawl order file. It is also invoked when pausing or stopping a crawl to capture the state at that point. Default behavior is call to CrawlController.logProgressStatistics(java.lang.String) so CrawlController can act on progress statistics event.

It is recommended that for implementations of this method it be carefully considered if it should be synchronized in whole or in part

Overrides:: progressStatisticsEvent in class AbstractTracker

Parameters:: e - Progress statistics event.

getProgressStatisticsLine

public java.lang.String getProgressStatisticsLine(java.util.Date now)

Return one line of current progress-statistics

Parameters:: now -
Returns:: String of stats

getProgressStatistics

public java.util.Map<java.lang.String,java.lang.Number> getProgressStatistics()

Specified by:: getProgressStatistics in interface StatisticsTracking

Returns:: Map of progress-statistics.

getProgressStatisticsLine

public java.lang.String getProgressStatisticsLine()

Return one line of current progress-statistics

Specified by:: getProgressStatisticsLine in interface StatisticsTracking

Returns:: String of stats

processedDocsPerSec

public double processedDocsPerSec()

Description copied from interface: StatisticsTracking

Returns the number of documents that have been processed per second over the life of the crawl (as of last snapshot)

Specified by:: processedDocsPerSec in interface StatisticsTracking

Returns:: The rate per second of documents gathered so far

currentProcessedDocsPerSec

public double currentProcessedDocsPerSec()

Description copied from interface: StatisticsTracking

Returns an estimate of recent document download rates based on a queue of recently seen CrawlURIs (as of last snapshot).

Specified by:: currentProcessedDocsPerSec in interface StatisticsTracking

Returns:: The rate per second of documents gathered during the last snapshot

processedKBPerSec

public long processedKBPerSec()

Description copied from interface: StatisticsTracking

Calculates the rate that data, in kb, has been processed over the life of the crawl (as of last snapshot.)

Specified by:: processedKBPerSec in interface StatisticsTracking

Returns:: The rate per second of KB gathered so far

currentProcessedKBPerSec

public int currentProcessedKBPerSec()

Description copied from interface: StatisticsTracking

Calculates an estimate of the rate, in kb, at which documents are currently being processed by the crawler. For more accurate estimates set a larger queue size, or get and average multiple values (as of last snapshot).

Specified by:: currentProcessedKBPerSec in interface StatisticsTracking

Returns:: The rate per second of KB gathered during the last snapshot

getFileDistribution

public java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> getFileDistribution()

Returns a HashMap that contains information about distributions of encountered mime types. Key/value pairs represent mime type -> count.

Note: All the values are wrapped with a AtomicLong

Returns:: mimeTypeDistribution

incrementMapCount

protected static void incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
                                        java.lang.String key)

Increment a counter for a key in a given HashMap. Used for various aggregate data. As this is used to change Maps which depend on StatisticsTracker for their synchronization, this method should only be invoked from a a block synchronized on 'this'.

Parameters:: map - The HashMap; key - The key for the counter to be incremented, if it does not exist it will be added (set to 1). If null it will increment the counter "unknown".

incrementCacheCount

protected static void incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache,
                                          java.lang.String key)

Increment a counter for a key in a given cache. Used for various aggregate data.

Parameters:: cache - the ObjectIdentityCache; key - The key for the counter to be incremented, if it does not exist it will be added (set to 1). If null it will increment the counter "unknown".

incrementCacheCount

protected static void incrementCacheCount(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> cache,
                                          java.lang.String key,
                                          long increment)

Increment a counter for a key in a given cache by an arbitrary amount. Used for various aggregate data. The increment amount can be negative.

Parameters:: cache - The ObjectIdentityCache; key - The key for the counter to be incremented, if it does not exist it will be added (set to equal to increment). If null it will increment the counter "unknown".; increment - The amount to increment counter related to the key.

incrementMapCount

protected static void incrementMapCount(java.util.concurrent.ConcurrentMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
                                        java.lang.String key,
                                        long increment)

Increment a counter for a key in a given HashMap by an arbitrary amount. Used for various aggregate data. The increment amount can be negative.

Parameters:: map - The Map or ConcurrentMap; key - The key for the counter to be incremented, if it does not exist it will be added (set to equal to increment). If null it will increment the counter "unknown".; increment - The amount to increment counter related to the key.

getReverseSortedCopy

public java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)

Sort the entries of the given HashMap in descending order by their values, which must be longs wrapped with AtomicLong.

Elements are sorted by value from largest to smallest. Equal values are sorted in an arbitrary, but consistent manner by their keys. Only items with identical value and key are considered equal. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.

Parameters:: mapOfAtomicLongValues - Assumes values are wrapped with AtomicLong.
Returns:: a sorted set containing the same elements as the map.

getReverseSortedCopy

public java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(ObjectIdentityCache<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)

Sort the entries of the given ObjectIdentityCache in descending order by their values, which must be longs wrapped with AtomicLong.

Parameters:: mapOfAtomicLongValues - Assumes values are wrapped with AtomicLong.
Returns:: a sorted set containing the same elements as the map.

getStatusCodeDistribution

public java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> getStatusCodeDistribution()

Return a HashMap representing the distribution of status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. Note: All the values are wrapped with a AtomicLong

Returns:: statusCodeDistribution

getHostLastFinished

public java.util.concurrent.atomic.AtomicLong getHostLastFinished(java.lang.String host)

Returns the time (in millisec) when a URI belonging to a given host was last finished processing.

Parameters:: host - The host to look up time of last completed URI.
Returns:: Returns the time (in millisec) when a URI belonging to a given host was last finished processing. If no URI has been completed for host -1 will be returned.

getBytesPerHost

public long getBytesPerHost(java.lang.String host)

Returns the accumulated number of bytes downloaded from a given host.

Parameters:: host - name of the host
Returns:: the accumulated number of bytes downloaded from a given host

getBytesPerFileType

public long getBytesPerFileType(java.lang.String filetype)

Returns the accumulated number of bytes from files of a given file type.

Parameters:: filetype - Filetype to check.
Returns:: the accumulated number of bytes from files of a given mime type

threadCount

public int threadCount()

Get the total number of ToeThreads (sleeping and active)

Returns:: The total number of ToeThreads

activeThreadCount

public int activeThreadCount()

Description copied from interface: StatisticsTracking

Get the number of active (non-paused) threads.

Specified by:: activeThreadCount in interface StatisticsTracking

Returns:: Current thread count (or zero if can't figure it out).

percentOfDiscoveredUrisCompleted

public int percentOfDiscoveredUrisCompleted()

This returns the number of completed URIs as a percentage of the total number of URIs encountered (should be inverse to the discovery curve)

Returns:: The number of completed URIs as a percentage of the total number of URIs encountered

discoveredUriCount

public long discoveredUriCount()

Number of discovered URIs.

If crawl not running (paused or stopped) this will return the value of the last snapshot.

Returns:: A count of all uris encountered
See Also:: Frontier.discoveredUriCount()

finishedUriCount

public long finishedUriCount()

Number of URIs that have finished processing.

Returns:: Number of URIs that have finished processing
See Also:: Frontier.finishedUriCount()

failedFetchAttempts

public long failedFetchAttempts()

Get the total number of failed fetch attempts (connection failures -> give up, etc)

Returns:: The total number of failed fetch attempts

disregardedFetchAttempts

public long disregardedFetchAttempts()

Get the total number of failed fetch attempts (connection failures -> give up, etc)

Returns:: The total number of failed fetch attempts

successfullyFetchedCount

public long successfullyFetchedCount()

Description copied from interface: StatisticsTracking

Number of successfully processed URIs.

If crawl not running (paused or stopped) this will return the value of the last snapshot.

Specified by:: successfullyFetchedCount in interface StatisticsTracking

Returns:: The number of successully fetched URIs
See Also:: Frontier.succeededFetchCount()

totalCount

public long totalCount()

Specified by:: totalCount in interface StatisticsTracking

Returns:: Total number of URIs (processed + queued + currently being processed)

congestionRatio

public float congestionRatio()

Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads.

Specified by:: congestionRatio in interface StatisticsTracking

Returns:: float congestion ratio

deepestUri

public long deepestUri()

Ordinal position of the 'deepest' URI eligible for crawling. Essentially, the length of the longest frontier internal queue.

Specified by:: deepestUri in interface StatisticsTracking

Returns:: long URI count to deepest URI

averageDepth

public long averageDepth()

Average depth of the last URI in all eligible queues. That is, the average length of all eligible queues.

Specified by:: averageDepth in interface StatisticsTracking

Returns:: long average depth of last URIs in queues

queuedUriCount

public long queuedUriCount()

Number of URIs queued up and waiting for processing.

If crawl not running (paused or stopped) this will return the value of the last snapshot.

Returns:: Number of URIs queued up and waiting for processing.
See Also:: Frontier.queuedUriCount()

totalBytesWritten

public long totalBytesWritten()

Deprecated. use totalBytesCrawled

Description copied from interface: StatisticsTracking

Returns the total number of uncompressed bytes processed. Stored data may be much smaller due to compression or duplicate-reduction policies.

Specified by:: totalBytesWritten in interface StatisticsTracking

Returns:: The total number of uncompressed bytes written to disk

totalBytesCrawled

public long totalBytesCrawled()

Description copied from interface: StatisticsTracking

Returns the total number of uncompressed bytes crawled. Stored data may be much smaller due to compression or duplicate-reduction policies.

Specified by:: totalBytesCrawled in interface StatisticsTracking

Returns:: The total number of uncompressed bytes crawled

crawledBytesSummary

public java.lang.String crawledBytesSummary()

crawledURISuccessful

public void crawledURISuccessful(CrawlURI curi)

Description copied from interface: CrawlURIDispositionListener

Notification of a successfully crawled URI

Specified by:: crawledURISuccessful in interface CrawlURIDispositionListener

Parameters:: curi - The relevant CrawlURI

saveSourceStats

protected void saveSourceStats(java.lang.String source,
                               java.lang.String hostname)

saveHostStats

protected void saveHostStats(java.lang.String hostname,
                             long size)

crawledURINeedRetry

public void crawledURINeedRetry(CrawlURI curi)

Description copied from interface: CrawlURIDispositionListener

Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems).

Specified by:: crawledURINeedRetry in interface CrawlURIDispositionListener

Parameters:: curi - The relevant CrawlURI

crawledURIDisregard

public void crawledURIDisregard(CrawlURI curi)

Description copied from interface: CrawlURIDispositionListener

Notification of a crawled URI that is to be disregarded. Usually this means that the robots.txt file for the relevant site forbids this from being crawled and we are therefor not going to keep it. Other reasons may apply. In all cases this means that it was successfully downloaded but will not be stored.

Specified by:: crawledURIDisregard in interface CrawlURIDispositionListener

Parameters:: curi - The relevant CrawlURI

crawledURIFailure

public void crawledURIFailure(CrawlURI curi)

Description copied from interface: CrawlURIDispositionListener

Notification of a failed crawling of a URI. The failure is of a type that precludes retries (either by it's very nature or because it has been retried to many times)

Specified by:: crawledURIFailure in interface CrawlURIDispositionListener

Parameters:: curi - The relevant CrawlURI

getSeeds

public java.util.Iterator<java.lang.String> getSeeds()

Get a seed iterator for the job being monitored. Note: This iterator will iterate over a list of strings not UURIs like the Scope seed iterator. The strings are equal to the URIs' getURIString() values.

Returns:: the seed iterator FIXME: Consider using TransformingIterator here

getSeedRecordsSortedByStatusCode

public java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode()

Description copied from interface: StatisticsTracking

Get a SeedRecord iterator for the job being monitored. If job is no longer running, stored values will be returned. If job is running, current seed iterator will be fetched and stored values will be updated.

Sort order is:
No status code (not processed)
Status codes smaller then 0 (largest to smallest)
Status codes larger then 0 (largest to smallest)

Note: This iterator will iterate over a list of SeedRecords.

Specified by:: getSeedRecordsSortedByStatusCode in interface StatisticsTracking

Returns:: the seed iterator

getSeedRecordsSortedByStatusCode

protected java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode(java.util.Iterator<java.lang.String> i)

crawlEnded

public void crawlEnded(java.lang.String message)

Description copied from interface: CrawlStatusListener

Called when a CrawlController has ended a crawl and is about to exit.

Specified by:: crawlEnded in interface CrawlStatusListener
Overrides:: crawlEnded in class AbstractTracker

Parameters:: message - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:: CrawlStatusListener.crawlEnded(java.lang.String)

writeSeedsReportTo

protected void writeSeedsReportTo(java.io.PrintWriter writer)

Parameters:: writer - Where to write.

writeSourceReportTo

protected void writeSourceReportTo(java.io.PrintWriter writer)

getReverseSortedHostCounts

public java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedHostCounts(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostCounts)

Return a copy of the hosts distribution in reverse-sorted (largest first) order.

Returns:: SortedMap of hosts distribution

writeHostsReportTo

protected void writeHostsReportTo(java.io.PrintWriter writer)

fixup

protected java.lang.String fixup(java.lang.String hostName)

writeReportLine

protected void writeReportLine(java.io.PrintWriter writer,
                               java.lang.Object... fields)

getReverseSortedHostsDistribution

public java.util.SortedMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedHostsDistribution()

Return a copy of the hosts distribution in reverse-sorted (largest first) order.

Returns:: SortedMap of hosts distribution

writeMimetypesReportTo

protected void writeMimetypesReportTo(java.io.PrintWriter writer)

writeResponseCodeReportTo

protected void writeResponseCodeReportTo(java.io.PrintWriter writer)

writeCrawlReportTo

protected void writeCrawlReportTo(java.io.PrintWriter writer)

writeProcessorsReportTo

protected void writeProcessorsReportTo(java.io.PrintWriter writer)

writeReportFile

protected void writeReportFile(java.lang.String reportName,
                               java.lang.String filename)

writeManifestReportTo

protected void writeManifestReportTo(java.io.PrintWriter writer)

Parameters:: writer - Where to write.

writeFrontierReportTo

protected void writeFrontierReportTo(java.io.PrintWriter writer)

Write the Frontier's 'nonempty' report (if available)

Parameters:: writer - to report to

dumpReports

public void dumpReports()

Run the reports.

Overrides:: dumpReports in class AbstractTracker

crawlCheckpoint

public void crawlCheckpoint(java.io.File cpDir)
                     throws java.lang.Exception

Description copied from interface: CrawlStatusListener

Called by CrawlController when checkpointing.

Specified by:: crawlCheckpoint in interface CrawlStatusListener

Parameters:: cpDir - Checkpoint dir. Write checkpoint state here.
Throws:: java.lang.Exception - A fatal exception. Any exceptions that are let out of this checkpoint are assumed fatal and terminate further checkpoint processing.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.admin Class StatisticsTracker

lastPagesFetchedCount

lastProcessedBytesCount

discoveredUriCount

queuedUriCount

finishedUriCount

downloadedUriCount

downloadFailures

downloadDisregards

docsPerSecond

currentDocsPerSecond

currentKBPerSec

totalKBPerSec

busyThreads

totalProcessedBytes

congestionRatio

deepestUri

averageDepth

crawledBytes

notModifiedUriCount

dupByHashUriCount

novelUriCount

mimeTypeDistribution

mimeTypeBytes

statusCodeDistribution

hostsDistribution

hostsBytes

hostsLastFinished

sourceHostDistribution

processedSeedsRecords

StatisticsTracker

initialize

finalCleanup

progressStatisticsEvent

getProgressStatisticsLine

getProgressStatistics

getProgressStatisticsLine

processedDocsPerSec

currentProcessedDocsPerSec

processedKBPerSec

currentProcessedKBPerSec

getFileDistribution

incrementMapCount

incrementCacheCount

incrementCacheCount

incrementMapCount

getReverseSortedCopy

getReverseSortedCopy

getStatusCodeDistribution

getHostLastFinished

getBytesPerHost

getBytesPerFileType

threadCount

activeThreadCount

percentOfDiscoveredUrisCompleted

discoveredUriCount

finishedUriCount

failedFetchAttempts

disregardedFetchAttempts

successfullyFetchedCount

totalCount

congestionRatio

deepestUri

averageDepth

queuedUriCount

totalBytesWritten

totalBytesCrawled

crawledBytesSummary

crawledURISuccessful

saveSourceStats

saveHostStats

crawledURINeedRetry

crawledURIDisregard

crawledURIFailure

getSeeds

getSeedRecordsSortedByStatusCode

getSeedRecordsSortedByStatusCode

crawlEnded

writeSeedsReportTo

writeSourceReportTo

org.archive.crawler.admin
Class StatisticsTracker