10. Writing a Scope

A CrawlScope [3] instance defines which URIs are "in" a particular crawl. It is essentially a Filter which determines (actually it subclasses the Filter class), looking at the totality of information available about a CandidateURI/CrawlURI instance, if that URI should be scheduled for crawling. Dynamic information inherent in the discovery of the URI, such as the path by which it was discovered, may be considered. Dynamic information which requires the consultation of external and potentially volatile information, such as current robots.txt requests and the history of attempts to crawl the same URI, should NOT be considered. Those potentially high-latency decisions should be made at another step.

As with Filters, the scope will be going through a refactoring. Because of that we will only briefly describe how to create new Scopes at this point.

All Scopes should subclass the CrawlScope class. Instead of overriding the innerAccepts method as you would do if you created a filter, the CrawlScope class implements this and instead offers several other methods that should be overriden instead. These methods acts as different type of filters that the URI will be checked against. In addition the CrawlScope class offers a list of exclude filters which can be set on every scope. If the URI is accepted (matches the test) by any of the filters in the exclude list, it will be considered being out of scope. The implementation of the innerAccepts method in the CrawlSope is as follows:

protected final boolean innerAccepts(Object o) {
    return ((isSeed(o) || focusAccepts(o)) || additionalFocusAccepts(o) ||
            transitiveAccepts(o)) && !excludeAccepts(o);
}

The result is that the checked URI is considered being inside the crawl's scope if it is a seed or is accepted by any of the focusAccepts, additionalFocusAccepts or transitiveAccepts, unless it is matched by any of the exclude filters.

When writing your own scope the methods you might want to override are:



[3] It has been identified problems with how the Scopes are defined. Please see the user manual for a discussion of the problems with the current Scopes. The proposed changes to the Scope will affect the Filters as well.