org.archive.crawler.framework
Class AbstractTracker

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.AbstractTracker
All Implemented Interfaces:
java.io.Serializable, java.lang.Runnable, javax.management.DynamicMBean, CrawlStatusListener, StatisticsTracking
Direct Known Subclasses:
StatisticsTracker

public abstract class AbstractTracker
extends ModuleType
implements StatisticsTracking, CrawlStatusListener, java.io.Serializable

A partial implementation of the StatisticsTracking interface.

It covers the thread handling. (Launching, pausing etc.) Included in this is keeping track of the total time spent (actually) crawling. Several methods to access the time started, finished etc. are provided.

To handle the thread work the class implements the CrawlStatusListener and uses it's events to pause, resume and stop logging of statistics. The run() method will call logActivity() at intervals specified in the crawl order.

Implementation of logActivity (the actual logging) as well as listening for CrawlURIDisposition events is not addressed.

Author:
Kristinn Sigurdsson
See Also:
StatisticsTracking, StatisticsTracker, Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_STATS_INTERVAL
          Attribute name for logging interval in seconds setting
protected  CrawlController controller
          A reference to the CrawlContoller of the crawl that we are to track statistics for.
protected  long crawlerEndTime
           
protected  long crawlerPauseStarted
           
protected  long crawlerStartTime
           
protected  long crawlerTotalPausedTime
           
static java.lang.Integer DEFAULT_STATISTICS_REPORT_INTERVAL
          Default period between logging stat values
protected  long lastLogPointTime
          Timestamp of when this logger last wrote something to the log
protected  boolean shouldrun
           
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.framework.StatisticsTracking
SEED_DISPOSITION_DISREGARD, SEED_DISPOSITION_FAILURE, SEED_DISPOSITION_NOT_PROCESSED, SEED_DISPOSITION_RETRY, SEED_DISPOSITION_SUCCESS
 
Constructor Summary
AbstractTracker(java.lang.String name, java.lang.String description)
           
 
Method Summary
 long crawlDuration()
          Returns how long the current crawl has been running (excluding any time spent paused/suspended/stopped) since it began.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
 void crawlEnding(java.lang.String sExitMessage)
          Called when a CrawlController is ending a crawl (for any reason)
 void crawlPaused(java.lang.String statusMessage)
          Called when a CrawlController is actually paused (all threads are idle).
 void crawlPausing(java.lang.String statusMessage)
          Called when a CrawlController is going to be paused.
 void crawlResuming(java.lang.String statusMessage)
          Called when a CrawlController is resuming a crawl that had been paused.
 void crawlStarted(java.lang.String message)
          Called on crawl start.
protected  void dumpReports()
          Dump reports, if any, on request or at crawl end.
protected  void finalCleanup()
          Cleanup resources used, at crawl end.
 long getCrawlEndTime()
          If crawl has ended it will return the time it ended (given by System.currentTimeMillis() at that time).
 long getCrawlerTotalElapsedTime()
          Total amount of time spent actively crawling so far.
 long getCrawlPauseStartedTime()
          Get the time when the the crawl was last paused/suspended (as given by System.currentTimeMillis() at that time).
 long getCrawlStartTime()
          Get the starting time of the crawl (as given by System.currentTimeMillis() when the crawl started).
 long getCrawlTotalPauseTime()
          Returns the number of milliseconds that the crawl spent paused or otherwise in a nonactive state.
protected  int getLogWriteInterval()
          The number of seconds to wait between writing snapshot data to log file.
 void initialize(CrawlController c)
          Sets up the Logger (including logInterval) and registers with the CrawlController for CrawlStatus and CrawlURIDisposition events.
protected  void logNote(java.lang.String note)
           
 void noteStart()
          Notify tracker that crawl has begun.
protected  void progressStatisticsEvent(java.util.EventObject e)
          A method for logging current crawler state.
 java.lang.String progressStatisticsLegend()
           
 void run()
          Start thread.
protected  void tallyCurrentPause()
          For a current pause (if any), add paused time to total and reset
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.archive.crawler.framework.StatisticsTracking
activeThreadCount, averageDepth, congestionRatio, currentProcessedDocsPerSec, currentProcessedKBPerSec, deepestUri, getProgressStatistics, getProgressStatisticsLine, getSeedRecordsSortedByStatusCode, processedDocsPerSec, processedKBPerSec, successfullyFetchedCount, totalBytesCrawled, totalBytesWritten, totalCount
 
Methods inherited from interface org.archive.crawler.event.CrawlStatusListener
crawlCheckpoint
 

Field Detail

DEFAULT_STATISTICS_REPORT_INTERVAL

public static final java.lang.Integer DEFAULT_STATISTICS_REPORT_INTERVAL
Default period between logging stat values


ATTR_STATS_INTERVAL

public static final java.lang.String ATTR_STATS_INTERVAL
Attribute name for logging interval in seconds setting

See Also:
Constant Field Values

controller

protected transient CrawlController controller
A reference to the CrawlContoller of the crawl that we are to track statistics for.


crawlerStartTime

protected long crawlerStartTime

crawlerEndTime

protected long crawlerEndTime

crawlerPauseStarted

protected long crawlerPauseStarted

crawlerTotalPausedTime

protected long crawlerTotalPausedTime

lastLogPointTime

protected long lastLogPointTime
Timestamp of when this logger last wrote something to the log


shouldrun

protected volatile boolean shouldrun
Constructor Detail

AbstractTracker

public AbstractTracker(java.lang.String name,
                       java.lang.String description)
Parameters:
name -
description -
Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException
Sets up the Logger (including logInterval) and registers with the CrawlController for CrawlStatus and CrawlURIDisposition events.

Specified by:
initialize in interface StatisticsTracking
Parameters:
c - A crawl controller instance.
Throws:
FatalConfigurationException - Not thrown here. For overrides that go to settings system for configuration.
See Also:
CrawlStatusListener, CrawlURIDispositionListener

run

public void run()
Start thread. Will call logActivity() at intervals specified by logInterval

Specified by:
run in interface java.lang.Runnable

progressStatisticsLegend

public java.lang.String progressStatisticsLegend()
Specified by:
progressStatisticsLegend in interface StatisticsTracking
Returns:
legend for progress-statistics lines/log

noteStart

public void noteStart()
Notify tracker that crawl has begun. Must be called outside tracker's own thread, to ensure it is noted before other threads start interacting with tracker.

Specified by:
noteStart in interface StatisticsTracking

progressStatisticsEvent

protected void progressStatisticsEvent(java.util.EventObject e)
A method for logging current crawler state. This method will be called by run() at intervals specified in the crawl order file. It is also invoked when pausing or stopping a crawl to capture the state at that point. Default behavior is call to CrawlController.logProgressStatistics(java.lang.String) so CrawlController can act on progress statistics event.

It is recommended that for implementations of this method it be carefully considered if it should be synchronized in whole or in part

Parameters:
e - Progress statistics event.

getCrawlStartTime

public long getCrawlStartTime()
Get the starting time of the crawl (as given by System.currentTimeMillis() when the crawl started).

Returns:
time fo the crawl's start

getCrawlEndTime

public long getCrawlEndTime()
If crawl has ended it will return the time it ended (given by System.currentTimeMillis() at that time).
If crawl is still going on it will return the same as System.currentTimeMillis() at the time of the call.

Returns:
The time of the crawl ending or the current time if the crawl has not ended.

getCrawlTotalPauseTime

public long getCrawlTotalPauseTime()
Returns the number of milliseconds that the crawl spent paused or otherwise in a nonactive state.

Returns:
the number of msec. that the crawl was paused or otherwise suspended.

getCrawlPauseStartedTime

public long getCrawlPauseStartedTime()
Get the time when the the crawl was last paused/suspended (as given by System.currentTimeMillis() at that time). Will be 0 if the crawl is not currently paused.

Returns:
time of the crawl's last pause/suspend or 0 if the crawl is not currently paused.

getCrawlerTotalElapsedTime

public long getCrawlerTotalElapsedTime()
Description copied from interface: StatisticsTracking
Total amount of time spent actively crawling so far.

Returns the total amount of time (in milliseconds) that has elapsed from the start of the crawl and until the current time or if the crawl has ended until the the end of the crawl minus any time spent paused.

Specified by:
getCrawlerTotalElapsedTime in interface StatisticsTracking
Returns:
Total amount of time (in msec.) spent crawling so far.

getLogWriteInterval

protected int getLogWriteInterval()
The number of seconds to wait between writing snapshot data to log file.

Returns:
the number of seconds to wait between writing snapshot data to log file.

crawlPausing

public void crawlPausing(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is going to be paused.

Specified by:
crawlPausing in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience
See Also:
CrawlStatusListener.crawlPausing(java.lang.String)

logNote

protected void logNote(java.lang.String note)

crawlPaused

public void crawlPaused(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is actually paused (all threads are idle).

Specified by:
crawlPaused in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_PAUSED. Passed for convenience

crawlResuming

public void crawlResuming(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is resuming a crawl that had been paused.

Specified by:
crawlResuming in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_RUNNING. Passed for convenience

tallyCurrentPause

protected void tallyCurrentPause()
For a current pause (if any), add paused time to total and reset


crawlEnding

public void crawlEnding(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is ending a crawl (for any reason)

Specified by:
crawlEnding in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlStatusListener.crawlEnded(java.lang.String)

crawlStarted

public void crawlStarted(java.lang.String message)
Description copied from interface: CrawlStatusListener
Called on crawl start.

Specified by:
crawlStarted in interface CrawlStatusListener
Parameters:
message - Start message.

dumpReports

protected void dumpReports()
Dump reports, if any, on request or at crawl end.


finalCleanup

protected void finalCleanup()
Cleanup resources used, at crawl end.


crawlDuration

public long crawlDuration()
Description copied from interface: StatisticsTracking
Returns how long the current crawl has been running (excluding any time spent paused/suspended/stopped) since it began.

Specified by:
crawlDuration in interface StatisticsTracking
Returns:
The length of time - in msec - that this crawl has been running.
See Also:
StatisticsTracking.crawlDuration()


Copyright © 2003-2011 Internet Archive. All Rights Reserved.