5. Settings

The settings framework is designed to be a flexible way to configure a crawl with special treatment for subparts of the web without adding too much performance overhead. If you want to write a module which should be configurable through the user interface, it is important to have a basic understanding of the Settings framework (for the impatient, have a look at Section 6, “Common needs for all configurable modules” for code examples). At its core the settings framework is a way to keep persistent, context sensitive configuration settings for any class in the crawler.

All classes in the crawler that have configurable settings, subclass ComplexType or one of its descendants. The ComplexType implements the javax.management.DynamicMBean interface. This gives you a way to ask the object for what attributes it supports the standard methods for getting and setting these attributes.

The entry point into the settings framework is the SettingsHandler. This class is responsible for loading and saving from persistent storage and for interconnecting the different parts of the framework.

Figure 3. Schematic view of the Settings Framework

Schematic view of the Settings Framework

5.1. Settings hierarchy

The settings framework supports a hierarchy of settings. This hierarchy is built by CrawlerSettings objects. On the top there is a settings object representing the global settings. This consists of all the settings that a crawl job needs for running. Beneath this global object there is one "per" settings object for each host/domain which has settings that should override the order for that particular host or domain.

When the settings framework is asked for an attribute for a specific host, it will first try to see if this attribute is set for this particular host. If it is, the value will be returned. If not, it will go up one level recursively until it eventually reaches the order object and returns the global value. If no value is set here either (normally it would be), a hard coded default value is returned.

All per domain/host settings objects only contain those settings which are to be overridden for that particular domain/host. The convention is to name the top level object "global settings" and the objects beneath "per settings" or "overrides" (although the refinements described next, also do overriding).

To further complicate the picture, there is also settings objects called refinements. An object of this type belongs to a global or per settings object and overrides the settings in its owners object if some criteria is met. These criteria could be that the URI in question conforms to a regular expression or that the settings are consulted at a specific time of day limited by a time span.

5.2. ComplexType hierarchy

All the configurable modules in the crawler subclasses ComplexType or one of its descendants. The ComplexType is responsible for keeping the definition of the configurable attributes of the module. The actual values are stored in an instance of DataContainer. The DataContainer is never accessed directly from user code. Instead the user accesses the attributes through methods in the ComplexType. The attributes are accessed in different ways depending on if it is from the user interface or from inside a running crawl.

When an attribute is accessed from the URI (either reading or writing) you want to make sure that you are editing the attribute in the right context. When trying to override an attribute, you don't want the settings framework to traverse up to an effective value for the attribute, but instead want to know that the attribute is not set on this level. To achieve this, there is getLocalAttribute(CrawlerSettings settings, String name) and setAttribute(CrawlerSettings settings, Attribute attribute) methods taking a settings object as a parameter. These methods works only on the supplied settings object. In addition the methods getAttribute(name) and setAttribute(Attribute attribute) is there for conformance to the Java JMX specification. The latter two always works on the global settings object.

Getting an attribute within a crawl is different in that you always want to get a value even if it is not set in its context. That means that the settings framework should work its way up the settings hierarchy to find the value in effect for the context. The method getAttribute(String name, CrawlURI uri) should be used to make sure that the right context is used. The Figure 4, “Flow of getting an attribute” shows how the settings framework finds the effective value given a context.

Figure 4. Flow of getting an attribute

Flow of getting an attribute

The different attributes each have a type. The allowed types all subclass the Type class. There are tree main types:

  1. SimpleType

  2. ListType

  3. ComplexType

Except for the SimpleType, the actual type used will be a subclass of one of these main types.

5.2.1. SimpleType

The SimpleType is mainly for representing Java wrappers for the Java primitive types. In addition it also handles the java.util.Date type and a special Heritrix TextField type. Overrides of a SimpleType must be of the same type as the initial default value for the SimpleType.

5.2.2. ListType

The ListType is further subclassed into versions for some of the wrapped Java primitive types (DoubleList, FloatList, IntegerList, LongList, StringList). A List holds values in the same order as they were added. If an attribute of type ListType is overridden, then the complete list of values is replaced at the override level.

5.2.3. ComplexType

The ComplexType is a map of name/value pairs. The values can be any Type including new ComplexTypes. The ComplexType is defined abstract and you should use one of the subclasses MapType or ModuleType. The MapType allows adding of new name/value pairs at runtime, while the ModuleType only allows the name/value pairs that it defines at construction time. When overriding the MapType the options are either override the value of an already existing attribute or add a new one. It is not possible in an override to remove an existing attribute. The ModuleType doesn't allow additions in overrides, but the predefined attributes' values might be overridden. Since the ModuleType is defined at construction time, it is possible to set more restrictions on each attribute than in the MapType. Another consequence of definition at construction time is that you would normally subclass the ModuleType, while the MapType is usable as it is. It is possible to restrict the MapType to only allow attributes of a certain type. There is also a restriction that MapTypes can not contain nested MapTypes.