org.archive.crawler.admin
Class StatisticsSummary

java.lang.Object
  extended by org.archive.crawler.admin.StatisticsSummary

public class StatisticsSummary
extends java.lang.Object

This class provides descriptive statistics of a finished crawl job by using the crawl report files generated by StatisticsTracker. Any formatting changes to the way StatisticsTracker writes to the summary crawl reports will require changes to this class.

The following statistics are accessible from this class:

TODO: Make it so summarizing is not done all in RAM so we avoid OOME.

Author:
Frank McCown
See Also:
StatisticsTracker

Field Summary
protected  java.lang.String bandwidthKbytesPerSec
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> dnsStatusCodeDistribution
           
protected  java.lang.String durationTime
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsBytes
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDistribution
          Keep track of hosts
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDnsBytes
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDnsDistribution
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeBytes
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDistribution
          Keep track of the file types we see (mime type -> count)
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDnsBytes
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDnsDistribution
           
protected  java.lang.String processedDocsPerSec
           
protected  java.util.Map<java.lang.String,SeedRecord> processedSeedsRecords
          Keep track of processed seeds
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> statusCodeDistribution
          Keep track of status codes
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldBytes
           
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldDistribution
          Keep track of TLDs
protected  java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldHostDistribution
           
protected  java.lang.String totalDataWritten
           
protected  long totalDnsHostDocuments
           
protected  long totalDnsHostSize
           
protected  long totalDnsMimeSize
           
protected  long totalDnsMimeTypeDocuments
           
protected  long totalDnsStatusCodeDocuments
           
protected  long totalFileTypeDocuments
           
protected  long totalHostDocuments
           
protected  long totalHosts
           
protected  long totalHostSize
           
protected  long totalMimeSize
           
protected  long totalMimeTypeDocuments
           
protected  long totalStatusCodeDocuments
           
protected  long totalTldDocuments
           
protected  long totalTldSize
           
 
Constructor Summary
StatisticsSummary(CrawlJob cjob)
          Constructor
 
Method Summary
 java.lang.String getBandwidthKbytesPerSec()
           
 long getBytesPerHost(java.lang.String host)
          Returns the accumulated number of bytes downloaded from a given host.
 long getBytesPerMimeType(java.lang.String filetype)
          Returns the accumulated number of bytes from files of a given file type.
 long getBytesPerTld(java.lang.String tld)
          Returns the total number of bytes downloaded for a given TLD.
 java.util.Hashtable getDnsMimeDistribution()
           
 java.util.Hashtable getDnsStatusCodeDistribution()
          Return a HashMap representing the distribution of DNS status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count.
 java.lang.String getDurationTime()
           
 java.util.Hashtable getHostsDnsDistribution()
           
 long getHostsPerTld(java.lang.String tld)
          Get the number of hosts with a particular TLD.
 java.util.Hashtable getMimeDistribution()
          Returns a HashMap that contains information about distributions of encountered mime types.
 java.lang.String getProcessedDocsPerSec()
           
 java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
          Sort the entries of the given HashMap in descending order by their values, which must be AtomicLongs.
 java.util.SortedMap getReverseSortedHostsDistribution()
          Return a copy of the hosts distribution in reverse-sorted (largest first) order.
 java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode()
          Returns sorted Iterator of seeds records based on status code.
 java.util.Hashtable getStatusCodeDistribution()
          Return a HashMap representing the distribution of HTTP status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count.
 java.util.Hashtable getTldBytes()
           
 java.util.Hashtable getTldDistribution()
           
 java.util.Hashtable getTldHostDistribution()
           
 java.lang.String getTotalDataWritten()
           
 long getTotalDnsHostDocuments()
           
 long getTotalDnsHostSize()
           
 long getTotalDnsMimeSize()
           
 long getTotalDnsMimeTypeDocuments()
           
 long getTotalDnsStatusCodeDocuments()
           
 long getTotalHostDnsDocuments()
           
 long getTotalHostDocuments()
           
 long getTotalHosts()
           
 long getTotalHostSize()
           
 long getTotalMimeSize()
           
 long getTotalMimeTypeDocuments()
           
 long getTotalStatusCodeDocuments()
           
 long getTotalTldDocuments()
           
 long getTotalTldSize()
           
protected static void incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key)
          Increment a counter for a key in a given HashMap.
protected static void incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key, long increment)
          Increment a counter for a key in a given HashMap by an arbitrary amount.
 boolean isStats()
           
 boolean readCrawlReport()
          Reads duration time, processed docs/sec, bandwidth, and total size of crawl from crawl-report.txt.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

totalDnsStatusCodeDocuments

protected long totalDnsStatusCodeDocuments

totalStatusCodeDocuments

protected long totalStatusCodeDocuments

totalFileTypeDocuments

protected long totalFileTypeDocuments

totalMimeTypeDocuments

protected long totalMimeTypeDocuments

totalDnsMimeTypeDocuments

protected long totalDnsMimeTypeDocuments

totalDnsHostDocuments

protected long totalDnsHostDocuments

totalHostDocuments

protected long totalHostDocuments

totalMimeSize

protected long totalMimeSize

totalDnsMimeSize

protected long totalDnsMimeSize

totalHostSize

protected long totalHostSize

totalDnsHostSize

protected long totalDnsHostSize

totalTldDocuments

protected long totalTldDocuments

totalTldSize

protected long totalTldSize

totalHosts

protected long totalHosts

durationTime

protected java.lang.String durationTime

processedDocsPerSec

protected java.lang.String processedDocsPerSec

bandwidthKbytesPerSec

protected java.lang.String bandwidthKbytesPerSec

totalDataWritten

protected java.lang.String totalDataWritten

mimeTypeDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDistribution
Keep track of the file types we see (mime type -> count)


mimeTypeBytes

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeBytes

mimeTypeDnsDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDnsDistribution

mimeTypeDnsBytes

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDnsBytes

statusCodeDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> statusCodeDistribution
Keep track of status codes


dnsStatusCodeDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> dnsStatusCodeDistribution

hostsDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDistribution
Keep track of hosts


hostsBytes

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsBytes

hostsDnsDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDnsDistribution

hostsDnsBytes

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDnsBytes

tldDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldDistribution
Keep track of TLDs


tldBytes

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldBytes

tldHostDistribution

protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldHostDistribution

processedSeedsRecords

protected transient java.util.Map<java.lang.String,SeedRecord> processedSeedsRecords
Keep track of processed seeds

Constructor Detail

StatisticsSummary

public StatisticsSummary(CrawlJob cjob)
Constructor

Parameters:
cjob - Completed crawl job
Method Detail

incrementMapCount

protected static void incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
                                        java.lang.String key)
Increment a counter for a key in a given HashMap. Used for various aggregate data.

Parameters:
map - The HashMap
key - The key for the counter to be incremented, if it does not exist it will be added (set to 1). If null it will increment the counter "unknown".

incrementMapCount

protected static void incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
                                        java.lang.String key,
                                        long increment)
Increment a counter for a key in a given HashMap by an arbitrary amount. Used for various aggregate data. The increment amount can be negative.

Parameters:
map - The HashMap
key - The key for the counter to be incremented, if it does not exist it will be added (set to equal to increment). If null it will increment the counter "unknown".
increment - The amount to increment counter related to the key.

getMimeDistribution

public java.util.Hashtable getMimeDistribution()
Returns a HashMap that contains information about distributions of encountered mime types. Key/value pairs represent mime type -> count.

Note: All the values are wrapped with a AtomicLong

Returns:
mimeTypeDistribution

getTotalMimeTypeDocuments

public long getTotalMimeTypeDocuments()

getTotalDnsMimeTypeDocuments

public long getTotalDnsMimeTypeDocuments()

getTotalMimeSize

public long getTotalMimeSize()

getTotalDnsMimeSize

public long getTotalDnsMimeSize()

getStatusCodeDistribution

public java.util.Hashtable getStatusCodeDistribution()
Return a HashMap representing the distribution of HTTP status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. Note: All the values are wrapped with a AtomicLong

Returns:
statusCodeDistribution

getDnsStatusCodeDistribution

public java.util.Hashtable getDnsStatusCodeDistribution()
Return a HashMap representing the distribution of DNS status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. Note: All the values are wrapped with a AtomicLong

Returns:
dnsStatusCodeDistribution

getDnsMimeDistribution

public java.util.Hashtable getDnsMimeDistribution()

getTotalDnsStatusCodeDocuments

public long getTotalDnsStatusCodeDocuments()

getTotalStatusCodeDocuments

public long getTotalStatusCodeDocuments()

getTotalHostDocuments

public long getTotalHostDocuments()

getTotalDnsHostDocuments

public long getTotalDnsHostDocuments()

getHostsDnsDistribution

public java.util.Hashtable getHostsDnsDistribution()

getTotalHostDnsDocuments

public long getTotalHostDnsDocuments()

getTotalHostSize

public long getTotalHostSize()

getTotalDnsHostSize

public long getTotalDnsHostSize()

getTldDistribution

public java.util.Hashtable getTldDistribution()

getTldBytes

public java.util.Hashtable getTldBytes()

getTotalTldDocuments

public long getTotalTldDocuments()

getTotalTldSize

public long getTotalTldSize()

getTldHostDistribution

public java.util.Hashtable getTldHostDistribution()

getTotalHosts

public long getTotalHosts()

getDurationTime

public java.lang.String getDurationTime()

getProcessedDocsPerSec

public java.lang.String getProcessedDocsPerSec()

getBandwidthKbytesPerSec

public java.lang.String getBandwidthKbytesPerSec()

getTotalDataWritten

public java.lang.String getTotalDataWritten()

getReverseSortedCopy

public java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
Sort the entries of the given HashMap in descending order by their values, which must be AtomicLongs.

Elements are sorted by value from largest to smallest. Equal values are sorted in an arbitrary, but consistent manner by their keys. Only items with identical value and key are considered equal. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.

Parameters:
mapOfAtomicLongValues - Assumes values are AtomicLongs.
Returns:
a sorted set containing the same elements as the map.

getHostsPerTld

public long getHostsPerTld(java.lang.String tld)
Get the number of hosts with a particular TLD.

Parameters:
tld - top-level domain name
Returns:
Total crawled hosts

getBytesPerHost

public long getBytesPerHost(java.lang.String host)
Returns the accumulated number of bytes downloaded from a given host.

Parameters:
host - name of the host
Returns:
the accumulated number of bytes downloaded from a given host

getBytesPerTld

public long getBytesPerTld(java.lang.String tld)
Returns the total number of bytes downloaded for a given TLD.

Parameters:
tld - TLD
Returns:
the total number of bytes downloaded for a given TLD

getBytesPerMimeType

public long getBytesPerMimeType(java.lang.String filetype)
Returns the accumulated number of bytes from files of a given file type.

Parameters:
filetype - Filetype to check.
Returns:
the accumulated number of bytes from files of a given mime type

readCrawlReport

public boolean readCrawlReport()
Reads duration time, processed docs/sec, bandwidth, and total size of crawl from crawl-report.txt.

Returns:
true if stats found.

getSeedRecordsSortedByStatusCode

public java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode()
Returns sorted Iterator of seeds records based on status code.

Returns:
sorted Iterator of seeds records

getReverseSortedHostsDistribution

public java.util.SortedMap getReverseSortedHostsDistribution()
Return a copy of the hosts distribution in reverse-sorted (largest first) order.

Returns:
SortedMap of hosts distribution

isStats

public boolean isStats()
Returns:
True if we compiled stats, false if none to compile (e.g. there are no reports files on disk).


Copyright © 2003-2011 Internet Archive. All Rights Reserved.