All processors extend org.archive.crawler.framework.Processor. In fact that is a complete class and could be used as a valid processor, only it doesn't actually do anything.
Extending classes need to override the innerProcess(CrawlURI) method on it to add custom behavior. This method will be invoked for each URI being processed and this is therfore where a processor can affect it.
Basically the innerProcess method uses the CrawlURI that is passed to it as a parameter (see Section 11.1, “Accessing and updating the CrawlURI”) and the underlying HttpRecorder (managed by the ToeThread) (see Section 11.2, “The HttpRecorder”) to perform whatever actions are needed.
Fetchers read the CrawlURI, fetch the relevant document and write to the HttpRecorder. Extractors read the HttpRecorder and add the discovered URIs to the CrawlURIs list of discovered URIs etc. Not all processors need to make use of the HttpRecorder.
Several other methods can also be optionally overriden for special needs:
This method will be called after the crawl is set up, but before any URIs are crawled.
Basically it is a place to write initialization code that only needs to be run once at the start of a crawl.
Example: The FetchHTTP processor uses this method to load the cookies file specified in the configuration among other things.
This method will be called after the last URI has been processed that will be processed. That is at the very end of a crawl (assuming that the crawl terminates normally).
Basically a place to write finalization code.
Example: The FetchHTTP processor uses it to save cookies gathered during the crawl to a specified file.
This method returns a string that contains a human readable report on the status and/or progress of the processor. This report is accessible via the WUI and is also written to file at the end of a crawl.
Generally, a processor would include the number of URIs that they've handled, the number of links extracted (for link extractors) or any other abitrary information of relevance to that processor.
The CrawlURI contains quite a bit of data. For an exhaustive look refer to its Javadoc.
This method returns the CrawlURI's 'AList'.
getAlist()
has been deprecated.
Use the typed accessors/setters instead.
Keys to values and objects placed in it are defined in the CoreAttributeConstants.
It is of course possible to add any arbitrary entry to it but that requires writing both the module that sets the value and the one that reads it. This may in fact be desirable in many cases, but where the keys defined in CoreAttributeConstants suffice and fit we strongly recommend using them.
The following is a quick overview of the most used CoreAttributeConstants:
A_CONTENT_TYPE
Extracted MIME type of fetched content; should be set immediately by fetching module if possible (rather than waiting for a later analyzer)
LINK COLLECTIONS
There are several Java Collection containing URIs extracted from different sources. Each link is a Link containing the extracted URI. The URI can be relative. The LinksScoper will read this list and convert Links inscope to CandidateURIs for adding to the Frontier by FrontierScheduler.
CrawlURI has the convenience method addLinkToCollection(link, collection) for adding links to these collections. This methods is the prefered way of adding to the collections.
A_CSS_LINKS
URIs extracted from CSS stylesheets
A_HTML_EMBEDS
URIs belived to be embeds.
A_HTML_LINKS
Regularly discovered URIs. Despite the name the links could have (in theroy) been found
A_HTML_SPECULATIVE_EMBEDS
URIs discovered via aggressive link extraction. Are treated as embeds but generally with lesser tolerance for nested embeds.
A_HTTP_HEADER_URIS
URIs discovered in the returned HTTP header. Usually only redirects.
See the Javadoc for CoreAttributeConstants for more.
Set the HttpRecorder that contains the fetched document. This is generally done by the fetching processor.
A method to get at the fetched document. See Section 11.2, “The HttpRecorder” for more.
If a document has been fetched, this method will return its size in bytes.
The content (mime) type of the fetched document.
For more see the Javadoc for CrawlURI.
A HttpRecorder is attached to each CrawlURI that is successfully fetched by the FetchHTTP processor. Despite its name it could be used for non-http transactions if some care is taken. This class will likely be subject to some changes in the future to make it more general.
Basically it pairs together a RecordingInputStream and RecordingOutputStream to capture exactly a single HTTP transaction.
Before it can be written to a processor, it must get a reference to the current threads HttpRecorder. This is done by invoking the HttpRecorder class' static method getHttpRecorder(). This will return the HttpRecorder for the current thread. Fetchers should then add a reference to this to the CrawlURI via the method discussed above.
Once a processor has the HttpRecorder object it can access its RecordingInputStream stream via the getRecordedInput() method. The RecordingInputStream extends InputStream and should be used to capture the incoming document.
Processors interested in the contents of the HttpRecorder can get at its ReplayCharSequence via its getReplayCharSequence() method. The ReplayCharSequence is basically a java.lang.CharSequence that can be read normally. As discussed above the CrawlURI has a method for getting at the existing HttpRecorder.
The following example is a very simple extractor.
package org.archive.crawler.extractor; import java.util.regex.Matcher; import javax.management.AttributeNotFoundException; import org.archive.crawler.datamodel.CoreAttributeConstants; import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.framework.Processor; import org.archive.crawler.settings.SimpleType; import org.archive.crawler.settings.Type; import org.archive.crawler.extractor.Link; import org.archive.util.TextUtils; /** * A very simple extractor. Will assume that any string that matches a * configurable regular expression is a link. * * @author Kristinn Sigurdsson */ public class SimpleExtractor extends Processor implements CoreAttributeConstants { public static final String ATTR_REGULAR_EXPRESSION = "input-param"; public static final String DEFAULT_REGULAR_EXPRESSION = "http://([a-zA-Z0-9]+\\.)+[a-zA-Z0-9]+/"; //Find domains int numberOfCURIsHandled = 0; int numberOfLinksExtracted = 0; public SimpleExtractor(String name) { super(name, "A very simple link extractor. Doesn't do anything useful."); Type e; e = addElementToDefinition(new SimpleType(ATTR_REGULAR_EXPRESSION, "How deep to look into files for URI strings, in bytes", DEFAULT_REGULAR_EXPRESSION)); e.setExpertSetting(true); } protected void innerProcess(CrawlURI curi) { if (!curi.isHttpTransaction()) { // We only handle HTTP at the moment. return; } numberOfCURIsHandled++; CharSequence cs = curi.getHttpRecorder().getReplayCharSequence(); String regexpr = null; try { regexpr = (String)getAttribute(ATTR_REGULAR_EXPRESSION,curi); } catch(AttributeNotFoundException e) { regexpr = DEFAULT_REGULAR_EXPRESSION; } Matcher match = TextUtils.getMatcher(regexpr, cs); while (match.find()){ String link = cs.subSequence(match.start(),match.end()).toString(); curi.createAndAddLink(link, Link.SPECULATIVE_MISC, Link.NAVLINK_HOP); numberOfLinksExtracted++; System.out.println("SimpleExtractor: " + link); } TextUtils.recycleMatcher(match); } public String report() { StringBuffer ret = new StringBuffer(); ret.append("Processor: org.archive.crawler.extractor." + "SimpleExtractor\n"); ret.append(" Function: Example extractor\n"); ret.append(" CrawlURIs handled: " + numberOfCURIsHandled + "\n"); ret.append(" Links extracted: " + numberOfLinksExtracted + "\n\n"); return ret.toString(); } }
The constructor. As with any Heritrix module it set's up the processors name, description and configurable parameters. In this case the only configurable parameter is the Regular expression that will be used to find links. Both a name and a default value is provided for this parameter. It is also marked as an expert setting. | |
Check if the URI was fetched via a HTTP transaction. If not it is probably a DNS lookup or was not fetched. Either way regular link extraction is not possible. | |
If we get this far then we have a URI that the processor will try to extract links from. Bump URI counter up by one. | |
Get the ReplayCharSequence. Can apply regular expressions on it directly. | |
Look up the regular expression to use. If the attribute is not found we'll use the default value. | |
Apply the regular expression. We'll use the TextUtils.getMatcher() utility method for performance reasons. | |
Extract a link discovered by the regular expression from the character sequence and store it as a string. | |
Add discovered link to the collection of regular links extracted from the current URI. | |
Note that we just discovered another link. | |
This is a handy debug line that will print each extracted link to the standard output. You would not want this in production code. | |
Free up the matcher object. This too is for performance. See the related javadoc. | |
The report states the name of the processor, its function and the totals of how many URIs were handled and how many links were extracted. A fairly typical report for an extractor. |
Even though the example above is fairly simple the processor nevertheless works as intended.
Classes extending Processor should not trap InterruptedExceptions.
InterruptedExceptions should be allowed to propagate to the ToeThread executing the processor.
Also they should immediately exit their main method
(innerProcess()
) if the
interrupted
flag is set.
For each processor only one instance is created per crawl. As there are multiple threads running, these processors must be carefully written so that no conflicts arise. This usually means that class variables can not be used for other things then gathering incremental statistics and data.
There is a facility for having an instance per thread but it has not been tested and will not be covered in this document.