Package org.archive.crawler.settings

Provides classes for the settings framework.

See:
          Description

Interface Summary
ValueErrorHandler If a ValueErrorHandler is registered with a SettingsHandler, only constraints with level Level.SEVERE will throw an InvalidAttributeValueException.
 

Class Summary
ComplexType Superclass of all configurable modules.
Constraint Superclass for constraints that can be set on attribute definitions.
CrawlerSettings Class representing a settings file.
CrawlSettingsSAXHandler An SAX element handler that updates a CrawlerSettings object.
CrawlSettingsSAXSource Class that takes a CrawlerSettings object and create SAXEvents from it.
DataContainer This class holds the data for a ComplexType for a settings object.
DoubleList List of Double values
FloatList List of Float values
IntegerList List of Integer values
LegalValueListConstraint A constraint that checks that an attribute value matches one of the items in the list of legal values.
LegalValueTypeConstraint A constraint that checks that an attribute value is of the right type
ListType<T> Super type for all lists.
LongList List of Long values
MapType This class represents a container of settings.
ModuleAttributeInfo  
ModuleType Superclass of all modules that should be configurable.
RegularExpressionConstraint A constraint that checks that a value matches a regular expression.
SettingsCache This class keeps a map of host names to settings objects.
SettingsFrameworkTestCase Set up a couple of settings to test different functions of the settings framework.
SettingsHandler An instance of this class holds a hierarchy of settings.
SimpleType A type that holds a Java type.
SoftSettingsHash  
SoftSettingsHash.SettingsEntry The entries in this hash extend SoftReference, using the host string as the key.
StringList List of String values.
TextField Class to hold values for text fields.
Type Interface implemented by all element types.
XMLSettingsHandler A SettingsHandler which uses XML files as persistent storage.
 

Package org.archive.crawler.settings Description

Provides classes for the settings framework.

The settings framework is designed to be a flexible way to configure a crawl with special treatment for subparts of the web without adding to much performance overhead.

At it's core the settings framework is a way to keep persistent, context sensitive configuration settings for any class in the crawler.

All classes in the crawler that has configurable settings subclasses ComplexType or one of its descendants. The ComplexType implements the DynamicMBean interface. This gives you a way to ask the object for what attributes it supports and standard methods for getting and setting these attributes.

The entry point into the settings framework is the SettingsHandler. This class is responsible for loading and saving from persistent storage and for interconnecting the different parts of the framework.


Figure 1. Schematic view of the Settings Framework

Settings hierarchy

The settings framework supports a hierarchy of settings. This hierarchy is built by CrawlerSettings objects. On the top there is a settings object representing the global settings. This consist of all the settings that a crawl job needs for running. Beneath this global object there is one "per" settings object for each host/domain which has settings that should override the order for that particular host or domain.

When the settings framework is asked for an attribute for a specific host, it will first try to see if this attribute is set for this particular host. If it is, the value will be returned. If not, it will go up one level recursively until it eventually reach the order object and returns the global value. If no value is set here either (normally it would be), a hard coded default value is returned.

All per domain/host settings objects only contain those settings which are to be overridden for that particular domain/host. The convention is to name the top level object "global settings" and the objects beneath "per settings" or "overrides" (although the refinements described next, also do overriding).

To further complicate the picture, there is also settings objects called refinements. An object of this type belongs to a global or per settings object and overrides the settings in it's owners object if some criteria is met. These criteria could be that the URI in question conforms to a regular expression or that it the settings are consulted at a specific time of day limited by a time span.

ComplexType hierarchy

All the configurable modules in the crawler subclasses ComplexType or one of its descendants. The ComplexType is responsible for keeping the definition of the configurable attributes of the module. The actual values are stored in an instance of DataContainer. The DataContainer is never accessed directly from user code. Instead the user accesses the attributes through methods in the ComplexType. The attributes are accessed in different ways depending if it is from the user interface or from inside a running crawl.

When an attribute is accessed from the URI (either reading or writing) you want to make sure that you are editing the attribute in the right context. When trying to override an attribute, you don't want the settings framework to traverse up to effective value for the attribute, but instead want to know that the attribute is not set on this level. To achieve this, there is ComplexType.getLocalAttribute(CrawlerSettings settings, String name) and ComplexType.setAttribute(CrawlerSettings settings, Attribute attribute) methods taking a settings object as a parameter. These methods works only on the supplied settings object. In addition the methods ComplexType.getAttribute(String) and ComplexType.setAttribute(Attribute attribute) is there for conformance to the Java JMX specification. The latter two always works on the global settings object.

Getting an attribute within a crawl is different in that you always want to get a value even if it is not set in it's context. That means that the settings framework should work its way up the settings hierarchy to find the value in effect for the context. The method ComplexType.getAttribute(String name, CrawlURI uri) should be used to make sure that the right context is used. Figure 2 shows how the settings framework finds the effective value given a context.


Figure 2. Flow of getting an attribute

The different attributes has a type. The allowed type all subclasses the Type class. There are tree main Types:

  1. SimpleType
  2. ListType
  3. ComplexType
Except for the SimpleType, the actual type used will be a subclass of one of these main types.

SimpleType

The SimpleType is mainly for representing Java??? wrappers for the Java??? primitive types. In addition it also handles the Date type and a special Heritrix TextField type. Overrides of a SimpleType must be of the same type as the initial default value for the SimpleType.

ListType

The ListType is further subclassed into versions for some of the wrapped Java??? primitive types (DoubleList, FloatList, IntegerList, LongList, StringList). A List holds values in the same order as they were added. If an attribute of type ListType is overridden, then the complete list of values is replaced at the override level.

ComplexType

The ComplexType is a map of name/value pairs. The values can be any Type including new MapTypes. The ComplexType is defined abstract and you should use one of the subclasses MapType or ModuleType. The MapType allows adding of new name/value pairs at runtime, while the ModuleType only allows the name/value pairs that it defines at construction time. When overriding the MapType the options are either override the value of an already existing attribute or add a new one. It is not possible in an override to remove an existing attribute. The ModuleType doesn't allow additions in overrides, but the predefined attributes' values might be overridden. Since the ModuleType is defined at construction time, it is possible to set more restrictions on each attribute than in the MapType. Another consequence of definition at construction time is that you would normally subclass the ModuleType, while the MapType is usable as it is. It is possible to restrict the MapType to only allow attributes of a certain type. There is also a restriction that MapTypes can not contain nested MapTypes.



Copyright © 2003-2011 Internet Archive. All Rights Reserved.