12. Writing a Statistics Tracker

A Statistics Tracker is a module that monitors the crawl and records statistics of interest to it.

Statistics Trackers must implement the StatisticsTracking interface. The interface imposes very little on the module.

Its initialization method provides the new statistics tracker with a reference to the CrawlController and thus the module has access to any part of the crawl.

Generally statistics trackers gather information by either querying the data exposed by the Frontier or by listening for CrawlURI disposition events and crawl status events.

The interface extends Runnable. This is based on the assumptions that statistics tracker are proactive in gathering their information. The CrawlController will start each statistics tracker once the crawl begins. If this facility is not needed in a statistics tracker (i.e. all information is gathered passively) simply implement the run() method as an empty method.

Note

For new statistics tracking modules to be available in the web user interface their class name must be added to the StatisticsTracking.options file under the conf/modules directory. The classes' full name (with package info) should be written in its own line, followed by a '|' and a descriptive name (containing only [a-z,A-Z]).

12.1. AbstractTracker

A partial implementation of a StatisticsTracker is provided in the frameworks package. The AbstractTracker implements the StatisticsTracking interface and adds the needed infrastructure for doing snapshots of the crawler status.

This is done implementing the thread aspects of the statistics tracker. This means that classes extending the AbstractTracker need not worry about thread handling, implementing the logActivity() method allows them to poll any information at fixed intervals.

AbstractTracker also listens for crawl status events and pauses and stops its activity based on them.

12.2. Provided StatisticsTracker

The admin package contains the only provided implementation of the statistics tracking interface. The StatisticsTracker is designed to write progress information to the progress-statistics.log as well as providing the web user interface with information about ongoing and completed crawls. It also dumps various reports at the end of each crawl.