11. Writing a Processor

All processors extend org.archive.crawler.framework.Processor. In fact that is a complete class and could be used as a valid processor, only it doesn't actually do anything.

Extending classes need to override the innerProcess(CrawlURI) method on it to add custom behavior. This method will be invoked for each URI being processed and this is therfore where a processor can affect it.

Basically the innerProcess method uses the CrawlURI that is passed to it as a parameter (see Section 11.1, “Accessing and updating the CrawlURI”) and the underlying HttpRecorder (managed by the ToeThread) (see Section 11.2, “The HttpRecorder”) to perform whatever actions are needed.

Fetchers read the CrawlURI, fetch the relevant document and write to the HttpRecorder. Extractors read the HttpRecorder and add the discovered URIs to the CrawlURIs list of discovered URIs etc. Not all processors need to make use of the HttpRecorder.

Several other methods can also be optionally overriden for special needs:

11.1. Accessing and updating the CrawlURI

The CrawlURI contains quite a bit of data. For an exhaustive look refer to its Javadoc.

  • getAlist()

    This method returns the CrawlURI's 'AList'.

    Note

    getAlist() has been deprecated. Use the typed accessors/setters instead.

    The AList is basically a hash map. It is used instead of the Java HashMap or Hashtable because it is more efficient (especially when it comes to serializing it). It also has typed methods for adding and getting strings, longs, dates etc.

    Keys to values and objects placed in it are defined in the CoreAttributeConstants.

    It is of course possible to add any arbitrary entry to it but that requires writing both the module that sets the value and the one that reads it. This may in fact be desirable in many cases, but where the keys defined in CoreAttributeConstants suffice and fit we strongly recommend using them.

    The following is a quick overview of the most used CoreAttributeConstants:

    • A_CONTENT_TYPE

      Extracted MIME type of fetched content; should be set immediately by fetching module if possible (rather than waiting for a later analyzer)

    • LINK COLLECTIONS

      There are several Java Collection containing URIs extracted from different sources. Each link is a Link containing the extracted URI. The URI can be relative. The LinksScoper will read this list and convert Links inscope to CandidateURIs for adding to the Frontier by FrontierScheduler.

      Note

      CrawlURI has the convenience method addLinkToCollection(link, collection) for adding links to these collections. This methods is the prefered way of adding to the collections.

      • A_CSS_LINKS

        URIs extracted from CSS stylesheets

      • A_HTML_EMBEDS

        URIs belived to be embeds.

      • A_HTML_LINKS

        Regularly discovered URIs. Despite the name the links could have (in theroy) been found

      • A_HTML_SPECULATIVE_EMBEDS

        URIs discovered via aggressive link extraction. Are treated as embeds but generally with lesser tolerance for nested embeds.

      • A_HTTP_HEADER_URIS

        URIs discovered in the returned HTTP header. Usually only redirects.

    See the Javadoc for CoreAttributeConstants for more.

  • setHttpRecorder(HttpRecorder)

    Set the HttpRecorder that contains the fetched document. This is generally done by the fetching processor.

  • getHttpRecorder()

    A method to get at the fetched document. See Section 11.2, “The HttpRecorder” for more.

  • getContentSize()

    If a document has been fetched, this method will return its size in bytes.

  • getContentType()

    The content (mime) type of the fetched document.

For more see the Javadoc for CrawlURI.

11.2. The HttpRecorder

A HttpRecorder is attached to each CrawlURI that is successfully fetched by the FetchHTTP processor. Despite its name it could be used for non-http transactions if some care is taken. This class will likely be subject to some changes in the future to make it more general.

Basically it pairs together a RecordingInputStream and RecordingOutputStream to capture exactly a single HTTP transaction.

11.2.1. Writing to HttpRecorder

Before it can be written to a processor, it must get a reference to the current threads HttpRecorder. This is done by invoking the HttpRecorder class' static method getHttpRecorder(). This will return the HttpRecorder for the current thread. Fetchers should then add a reference to this to the CrawlURI via the method discussed above.

Once a processor has the HttpRecorder object it can access its RecordingInputStream stream via the getRecordedInput() method. The RecordingInputStream extends InputStream and should be used to capture the incoming document.

11.2.2. Reading from HttpRecorder

Processors interested in the contents of the HttpRecorder can get at its ReplayCharSequence via its getReplayCharSequence() method. The ReplayCharSequence is basically a java.lang.CharSequence that can be read normally. As discussed above the CrawlURI has a method for getting at the existing HttpRecorder.

11.3. An example processor

The following example is a very simple extractor.

package org.archive.crawler.extractor;

import java.util.regex.Matcher;

import javax.management.AttributeNotFoundException;

import org.archive.crawler.datamodel.CoreAttributeConstants;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.framework.Processor;
import org.archive.crawler.settings.SimpleType;
import org.archive.crawler.settings.Type;
import org.archive.crawler.extractor.Link;
import org.archive.util.TextUtils;

/**
 * A very simple extractor. Will assume that any string that matches a 
 * configurable regular expression is a link.
 *
 * @author Kristinn Sigurdsson
 */
public class SimpleExtractor extends Processor
    implements CoreAttributeConstants
{
    public static final String ATTR_REGULAR_EXPRESSION = "input-param";
    public static final String DEFAULT_REGULAR_EXPRESSION = 
        "http://([a-zA-Z0-9]+\\.)+[a-zA-Z0-9]+/"; //Find domains
    
    int numberOfCURIsHandled = 0; 
    int numberOfLinksExtracted = 0;

    public SimpleExtractor(String name) { 1
        super(name, "A very simple link extractor. Doesn't do anything useful.");
        Type e;
        e = addElementToDefinition(new SimpleType(ATTR_REGULAR_EXPRESSION,
            "How deep to look into files for URI strings, in bytes",
            DEFAULT_REGULAR_EXPRESSION));
        e.setExpertSetting(true);
    }

    protected void innerProcess(CrawlURI curi) {

        if (!curi.isHttpTransaction()) 2
        {
            // We only handle HTTP at the moment.
            return;
        }
        
        numberOfCURIsHandled++; 3

        CharSequence cs = curi.getHttpRecorder().getReplayCharSequence(); 4
        String regexpr = null;
        try {
            regexpr = (String)getAttribute(ATTR_REGULAR_EXPRESSION,curi); 5
        } catch(AttributeNotFoundException e) {
            regexpr = DEFAULT_REGULAR_EXPRESSION;
        }

        Matcher match = TextUtils.getMatcher(regexpr, cs); 6
        
        while (match.find()){ 
            String link = cs.subSequence(match.start(),match.end()).toString(); 7
            curi.createAndAddLink(link, Link.SPECULATIVE_MISC, Link.NAVLINK_HOP);8
            numberOfLinksExtracted++; 9
            System.out.println("SimpleExtractor: " + link); 10
        }
        
        TextUtils.recycleMatcher(match); 11
    }

    public String report() { 12
        StringBuffer ret = new StringBuffer();
        ret.append("Processor: org.archive.crawler.extractor." +
            "SimpleExtractor\n");
        ret.append("  Function:          Example extractor\n");
        ret.append("  CrawlURIs handled: " + numberOfCURIsHandled + "\n");
        ret.append("  Links extracted:   " + numberOfLinksExtracted + "\n\n");

        return ret.toString();
    }
}
1

The constructor. As with any Heritrix module it set's up the processors name, description and configurable parameters. In this case the only configurable parameter is the Regular expression that will be used to find links. Both a name and a default value is provided for this parameter. It is also marked as an expert setting.

2

Check if the URI was fetched via a HTTP transaction. If not it is probably a DNS lookup or was not fetched. Either way regular link extraction is not possible.

3

If we get this far then we have a URI that the processor will try to extract links from. Bump URI counter up by one.

4

Get the ReplayCharSequence. Can apply regular expressions on it directly.

5

Look up the regular expression to use. If the attribute is not found we'll use the default value.

6

Apply the regular expression. We'll use the TextUtils.getMatcher() utility method for performance reasons.

7

Extract a link discovered by the regular expression from the character sequence and store it as a string.

8

Add discovered link to the collection of regular links extracted from the current URI.

9

Note that we just discovered another link.

10

This is a handy debug line that will print each extracted link to the standard output. You would not want this in production code.

11

Free up the matcher object. This too is for performance. See the related javadoc.

12

The report states the name of the processor, its function and the totals of how many URIs were handled and how many links were extracted. A fairly typical report for an extractor.

Even though the example above is fairly simple the processor nevertheless works as intended.

11.4. Things to keep in mind when writing a processor

11.4.1. Interruptions

Classes extending Processor should not trap InterruptedExceptions.

InterruptedExceptions should be allowed to propagate to the ToeThread executing the processor.

Also they should immediately exit their main method (innerProcess()) if the interrupted flag is set.

11.4.2. One processor, many threads

For each processor only one instance is created per crawl. As there are multiple threads running, these processors must be carefully written so that no conflicts arise. This usually means that class variables can not be used for other things then gathering incremental statistics and data.

There is a facility for having an instance per thread but it has not been tested and will not be covered in this document.