6. Common needs for all configurable modules

As mentioned earlier all configurable modules in Heritrix subclasses ComplexType (or one of its descendants). When you write your own module you should inherit from ModuleType which is a subclass of ComplexType intended to be subclassed by all modules in Heritrix.

6.1. Definition of a module

Heritrix knows how to handle a ComplexType and to get the needed information to render the user interface part for it. To make this happen your module has to obey some rules.

  1. A module should always implement a constructor taking exactly one argument - the name argument (see ModuleType(String name)).

  2. All attributes you want to be configurable should be defined in the constructor of the module.

6.1.1. The obligatory one argument constructor

All modules need to have a constructor taking a String argument. This string is used to identify the module. In the case where a module is of a type that is replacing an existing module of which there could only be one, it is important that the same name is being used. In this case the constructor might choose to ignore the name string and substitute it with a hard coded one. This is for example the case with the Frontier. The name of the Frontier should always be the string "frontier". For this reason the Frontier interface that all Frontiers should implement has a static variable:

public static final String ATTR_NAME = "frontier";
which implementations of the Frontier use instead of the string argument submitted to the constructor. Here is the part of the default Frontiers' constructor that shows how this should be done.
public Frontier(String name) {
    //The 'name' of all frontiers should be the same (Frontier.ATTR_NAME)
    //therefore we'll ignore the supplied parameter. 
    super(Frontier.ATTR_NAME, "HostQueuesFrontier. Maintains the internal" +
        " state of the crawl. It dictates the order in which URIs" +
        " will be scheduled. \nThis frontier is mostly a breadth-first" +
        " frontier, which refrains from emitting more than one" +
        " CrawlURI of the same \'key\' (host) at once, and respects" +
        " minimum-delay and delay-factor specifications for" +
        " politeness.");
As shown in this example, the constructor must call the superclass's constructor. This example also shows how to set the description of a module. The description is used by the user interface to guide the user in configuring the crawl. If you don't want to set a description (strongly discouraged), the ModuleType also has a one argument constructor taking just the name.

6.1.2. Defining attributes

The attributes on a module you want to be configurable must be defined in the modules constructor. For this purpose the ComplexType has a method addElementToDefinition(Type type). The argument given to this method is a definition of the attribute. The Type class is the superclass of all the attribute definitions allowed for a ModuleType. Since the ComplexType, which ModuleType inherits, is itself a subclass of Type, you can add new ModuleTypes as attributes to your module. The Type class implements configuration methods common for all Types that defines an attribute on your module. The addElementToDefinition method returns the added Type so that it is easy to refine the configuration of the Type. Lets look at an example (also from the default Frontier) of an attribute definition.

public final static String ATTR_MAX_OVERALL_BANDWIDTH_USAGE =
        "total-bandwidth-usage-KB-sec";
private final static Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE =
        new Integer(0);
...

Type t;
t = addElementToDefinition(
    new SimpleType(ATTR_MAX_OVERALL_BANDWIDTH_USAGE,
    "The maximum average bandwidth the crawler is allowed to use. " +
    "The actual readspeed is not affected by this setting, it only " +
    "holds back new URIs from being processed when the bandwidth " +
    "usage has been to high.\n0 means no bandwidth limitation.",
    DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE));
t.setOverrideable(false);
Here we add an attribute definition of the SimpleType (which is a subclass of Type). The SimpleType's constructor takes three arguments: name, description and a default value. Usually the name and default value are defined as constants like here, but this is of course optional. The line t.setOverrideable(false); informs the settings framework to not allow per overrides on this attribute. For a full list of methods for configuring a Type see the Type class.

6.2. Accessing attributes

In most cases when the module needs to access its own attributes, a CrawlURI is available. The right way to make sure that all the overrides and refinements are considered is then to use the method getAttribute(String name, CrawlURI uri) to get the attribute. Sometimes the context you are working in could be defined by other objects than the CrawlURI, then use the getAttribute(Object context, String name) method to get the value. This method tries its best at getting some useful context information out of an object. What it does is checking if the context is any kind of URI or a settings object. If it can't find anything useful, the global settings are used as the context. If you don't have any context at all, which is the case in some initialization code, the getAttribute(String name) could be used.

6.3. Putting together a simple module

From what we learned so far, let's put together a module that doesn't do anything useful, but show some of the concepts.

package myModule;

import java.util.logging.Level;
import java.util.logging.Logger;
import javax.management.AttributeNotFoundException;
import org.archive.crawler.settings.MapType;
import org.archive.crawler.settings.ModuleType;
import org.archive.crawler.settings.RegularExpressionConstraint;
import org.archive.crawler.settings.SimpleType;
import org.archive.crawler.settings.Type;

public class Foo extends ModuleType {
  private static Logger logger = Logger.getLogger("myModule.Foo"); 1

  public Foo(String name) {
    Type mySimpleType1 = new SimpleType(
                "name1", "Description1", new Integer(10)); 2
    addElementToDefinition(mySimpleType1);

    Type mySimpleType2 = new SimpleType(
                "name2", "Description2", "defaultValue");
    addElementToDefinition(mySimpleType2);
    mySimpleType2.addConstraint(new RegularExpressionConstraint( 3
                ".*Val.*", Level.WARNING,
                "This field must contain 'Val' as part of the string."));

    Type myMapType = new MapType("name3", "Description3", String.class); 4
    addElementToDefinition(myMapType);
  }

  public void getMyTypeValue(CrawlURI curi) {
    try {
      int maxBandwidthKB = ((Integer) getAttribute("name1", curi)).intValue(); 5
    } catch (AttributeNotFoundException e) {
      logger.warning(e.getMessage());
    }
  }

  public void playWithMap(CrawlURI curi) {
    try {
      MapType myMapType = (MapType) getAttribute("name3", curi);
      myMapType.addElement(
              null, new SimpleType("name", "Description", "defaultValue")); 6
      myMapType.setAttribute(new Attribute("name", "newValue")); 7
    } catch (Exception e) {
      logger.warning(e.getMessage());
    }
  }
}

This example shows several things:

1

One thing that we have not mentioned before is how we do general error logging. Heritrix uses the standard Java 1.4 logging facility. The convention is to initialize it with the class name.

2

Here we define and add a SimpleType that takes an Integer as the argument and setting it to '10' as the default value.

3

It is possible to add constraints on fields. In addition to be constrained to only take strings, this field add a requirement that the string should contain 'Val' as part of the string. The constraint also has a level and a description. The description is used by the user interface to give the user a fairly good explanation if the submitted value doesn't fit in with the constraint. Three levels are honored. Level.INFO

Level.INFO

Values are accepted even if they don't fulfill the constraint's requirement. This is used when you don't want to disallow the value, but warn the user that the value seems to be out of reasonable bounds.

Level.WARNING

The value must be accepted by the constraint to be valid in crawl jobs, but is legal in profiles even if it doesn't. This is used to be able to put values into a profile that a user should change for every crawl job derived from the profile.

Level.SEVERE

The value is not allowed whatsoever if it isn't accepted by the constraint.

See the Constraint class for more information.

4

This line defines a MapType allowing only Strings as values.

5

An example of how to read an attribute.

6

Here we add a new element to the MapType. This element is valid for this map because its default value is a String.

7

Now we change the value of the newly added attribute. JMX requires that the new value is wrapped in an object of type Attribute which holds both the name and the new value.

Note

To make your module known to Heritrix, you need to make mention of it in the appropriate src/conf/modules file: i.e. if your module is a Processor, it needs to be mentioned in the Processor.options file. The options files get built into the Heritrix jar.

A little known fact about Heritrix: When trying to read modules/Processor.options Heritrix will concatenate any such files it finds on the classpath. This means that if you write your own processor and wrap it in a jar you can simply include in that jar a modules/Processor.options file with just the one line needed to add your processor. Then simply add the new jar to the $HERITRIX_HOME/lib directory and you are done. No need to mess with the Heritrix binaries. For an example of how this is done, look at the code for this project: deduplicator ” [Kristinn Sigurðsson on the mailing list, 3281].

If everything seems ok so far, then we are almost ready to write some real modules.