|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.crawler.admin.StatisticsSummary
public class StatisticsSummary
This class provides descriptive statistics of a finished crawl job by using the crawl report files generated by StatisticsTracker. Any formatting changes to the way StatisticsTracker writes to the summary crawl reports will require changes to this class.
The following statistics are accessible from this class:
TODO: Make it so summarizing is not done all in RAM so we avoid OOME.
StatisticsTracker
Field Summary | |
---|---|
protected java.lang.String |
bandwidthKbytesPerSec
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
dnsStatusCodeDistribution
|
protected java.lang.String |
durationTime
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
hostsBytes
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
hostsDistribution
Keep track of hosts |
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
hostsDnsBytes
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
hostsDnsDistribution
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
mimeTypeBytes
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
mimeTypeDistribution
Keep track of the file types we see (mime type -> count) |
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
mimeTypeDnsBytes
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
mimeTypeDnsDistribution
|
protected java.lang.String |
processedDocsPerSec
|
protected java.util.Map<java.lang.String,SeedRecord> |
processedSeedsRecords
Keep track of processed seeds |
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
statusCodeDistribution
Keep track of status codes |
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
tldBytes
|
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
tldDistribution
Keep track of TLDs |
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
tldHostDistribution
|
protected java.lang.String |
totalDataWritten
|
protected long |
totalDnsHostDocuments
|
protected long |
totalDnsHostSize
|
protected long |
totalDnsMimeSize
|
protected long |
totalDnsMimeTypeDocuments
|
protected long |
totalDnsStatusCodeDocuments
|
protected long |
totalFileTypeDocuments
|
protected long |
totalHostDocuments
|
protected long |
totalHosts
|
protected long |
totalHostSize
|
protected long |
totalMimeSize
|
protected long |
totalMimeTypeDocuments
|
protected long |
totalStatusCodeDocuments
|
protected long |
totalTldDocuments
|
protected long |
totalTldSize
|
Constructor Summary | |
---|---|
StatisticsSummary(CrawlJob cjob)
Constructor |
Method Summary | |
---|---|
java.lang.String |
getBandwidthKbytesPerSec()
|
long |
getBytesPerHost(java.lang.String host)
Returns the accumulated number of bytes downloaded from a given host. |
long |
getBytesPerMimeType(java.lang.String filetype)
Returns the accumulated number of bytes from files of a given file type. |
long |
getBytesPerTld(java.lang.String tld)
Returns the total number of bytes downloaded for a given TLD. |
java.util.Hashtable |
getDnsMimeDistribution()
|
java.util.Hashtable |
getDnsStatusCodeDistribution()
Return a HashMap representing the distribution of DNS status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. |
java.lang.String |
getDurationTime()
|
java.util.Hashtable |
getHostsDnsDistribution()
|
long |
getHostsPerTld(java.lang.String tld)
Get the number of hosts with a particular TLD. |
java.util.Hashtable |
getMimeDistribution()
Returns a HashMap that contains information about distributions of encountered mime types. |
java.lang.String |
getProcessedDocsPerSec()
|
java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> |
getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
Sort the entries of the given HashMap in descending order by their values, which must be AtomicLong s. |
java.util.SortedMap |
getReverseSortedHostsDistribution()
Return a copy of the hosts distribution in reverse-sorted (largest first) order. |
java.util.Iterator<SeedRecord> |
getSeedRecordsSortedByStatusCode()
Returns sorted Iterator of seeds records based on status code. |
java.util.Hashtable |
getStatusCodeDistribution()
Return a HashMap representing the distribution of HTTP status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. |
java.util.Hashtable |
getTldBytes()
|
java.util.Hashtable |
getTldDistribution()
|
java.util.Hashtable |
getTldHostDistribution()
|
java.lang.String |
getTotalDataWritten()
|
long |
getTotalDnsHostDocuments()
|
long |
getTotalDnsHostSize()
|
long |
getTotalDnsMimeSize()
|
long |
getTotalDnsMimeTypeDocuments()
|
long |
getTotalDnsStatusCodeDocuments()
|
long |
getTotalHostDnsDocuments()
|
long |
getTotalHostDocuments()
|
long |
getTotalHosts()
|
long |
getTotalHostSize()
|
long |
getTotalMimeSize()
|
long |
getTotalMimeTypeDocuments()
|
long |
getTotalStatusCodeDocuments()
|
long |
getTotalTldDocuments()
|
long |
getTotalTldSize()
|
protected static void |
incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
java.lang.String key)
Increment a counter for a key in a given HashMap. |
protected static void |
incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map,
java.lang.String key,
long increment)
Increment a counter for a key in a given HashMap by an arbitrary amount. |
boolean |
isStats()
|
boolean |
readCrawlReport()
Reads duration time, processed docs/sec, bandwidth, and total size of crawl from crawl-report.txt. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected long totalDnsStatusCodeDocuments
protected long totalStatusCodeDocuments
protected long totalFileTypeDocuments
protected long totalMimeTypeDocuments
protected long totalDnsMimeTypeDocuments
protected long totalDnsHostDocuments
protected long totalHostDocuments
protected long totalMimeSize
protected long totalDnsMimeSize
protected long totalHostSize
protected long totalDnsHostSize
protected long totalTldDocuments
protected long totalTldSize
protected long totalHosts
protected java.lang.String durationTime
protected java.lang.String processedDocsPerSec
protected java.lang.String bandwidthKbytesPerSec
protected java.lang.String totalDataWritten
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDistribution
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeBytes
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDnsDistribution
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> mimeTypeDnsBytes
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> statusCodeDistribution
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> dnsStatusCodeDistribution
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDistribution
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsBytes
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDnsDistribution
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> hostsDnsBytes
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldDistribution
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldBytes
protected java.util.Hashtable<java.lang.String,java.util.concurrent.atomic.AtomicLong> tldHostDistribution
protected transient java.util.Map<java.lang.String,SeedRecord> processedSeedsRecords
Constructor Detail |
---|
public StatisticsSummary(CrawlJob cjob)
cjob
- Completed crawl jobMethod Detail |
---|
protected static void incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key)
map
- The HashMapkey
- The key for the counter to be incremented, if it does not
exist it will be added (set to 1). If null it will
increment the counter "unknown".protected static void incrementMapCount(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> map, java.lang.String key, long increment)
map
- The HashMapkey
- The key for the counter to be incremented, if it does not
exist it will be added (set to equal to
increment
).
If null it will increment the counter "unknown".increment
- The amount to increment counter related to the
key
.public java.util.Hashtable getMimeDistribution()
Note: All the values are wrapped with a
AtomicLong
public long getTotalMimeTypeDocuments()
public long getTotalDnsMimeTypeDocuments()
public long getTotalMimeSize()
public long getTotalDnsMimeSize()
public java.util.Hashtable getStatusCodeDistribution()
AtomicLong
public java.util.Hashtable getDnsStatusCodeDistribution()
AtomicLong
public java.util.Hashtable getDnsMimeDistribution()
public long getTotalDnsStatusCodeDocuments()
public long getTotalStatusCodeDocuments()
public long getTotalHostDocuments()
public long getTotalDnsHostDocuments()
public java.util.Hashtable getHostsDnsDistribution()
public long getTotalHostDnsDocuments()
public long getTotalHostSize()
public long getTotalDnsHostSize()
public java.util.Hashtable getTldDistribution()
public java.util.Hashtable getTldBytes()
public long getTotalTldDocuments()
public long getTotalTldSize()
public java.util.Hashtable getTldHostDistribution()
public long getTotalHosts()
public java.lang.String getDurationTime()
public java.lang.String getProcessedDocsPerSec()
public java.lang.String getBandwidthKbytesPerSec()
public java.lang.String getTotalDataWritten()
public java.util.TreeMap<java.lang.String,java.util.concurrent.atomic.AtomicLong> getReverseSortedCopy(java.util.Map<java.lang.String,java.util.concurrent.atomic.AtomicLong> mapOfAtomicLongValues)
AtomicLong
s.
Elements are sorted by value from largest to smallest. Equal values are sorted in an arbitrary, but consistent manner by their keys. Only items with identical value and key are considered equal. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.
mapOfAtomicLongValues
- Assumes values are AtomicLongs.
public long getHostsPerTld(java.lang.String tld)
tld
- top-level domain name
public long getBytesPerHost(java.lang.String host)
host
- name of the host
public long getBytesPerTld(java.lang.String tld)
tld
- TLD
public long getBytesPerMimeType(java.lang.String filetype)
filetype
- Filetype to check.
public boolean readCrawlReport()
public java.util.Iterator<SeedRecord> getSeedRecordsSortedByStatusCode()
public java.util.SortedMap getReverseSortedHostsDistribution()
public boolean isStats()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |