hcc 0.2.0-200809091854

The Heritrix Cluster Controller (hcc) is a set of packages that enable control of a cluster of heritrix instances running across multiple machines.

See:
          Description

Packages
org.archive.hcc  
org.archive.hcc.client  
org.archive.hcc.util  
org.archive.hcc.util.jmx  

 

The Heritrix Cluster Controller (hcc) is a set of packages that enable control of a cluster of heritrix instances running across multiple machines.

There are two main components - the controller itself and a client API for accessing the component. The controller itself is essentially a facade with a DynamicMBean interface. Internally it effectively finds all heritrix resources in a JNDI scope and then proxies all communication to them. It provides a set of attributes and methods which perform the general functions of finding, listing, and invoking operations on single remote instances or groups of them. The client serves to translate the generic MBean interface into an easy to use domain specific interface thus simplying the work of programmers interested in building application specific extensions generic JMX OpenDynamicMBean interface.

Requirements

In order to bring up a cluster controller, you need the following pieces:

It should be noted that since JNDI is basically a passive service, it must be running BEFORE you try to bring up heritrix. Be sure to read the instructions in the jndi.properties file in the heritrix jar in order to learn how to configure heritrix to talk to an external jndi server. Another tip: At least the last time I checked, it is necessary to execute heritrix from the $HERITRIX_HOME directory if you want Heritrix to read the jndi.properties file in $HERITRIX_HOME.

We've been using JBOSS's jndi service to fulfill this requirement. The aforementioned jndi.properties file contains instructions for configuring the JBOSS jndi server and client jars.

Getting Started

Setup

So let us assume you have your instance of heritrix started and it has successfully registered with the JNDI server. (You can verify using any JNDI viewer. JBoss comes with a web based JNDI viewer which we've found quite handy.)

Simple Configuration

Now it is possible to run the ClusterControllerBean in a jvm separate from the ClusterControllerClient interface. For the present, let us look at the simplest configuration: the ClusterControllerBean is running in the same JVM as the ClusterControllerClient. Both jmx bean and client are running on the same box as the jndi server (in turn running on the standard port of 1099).

Note that, to enable the JMX agent for local access, you need to set the sytem property com.sun.management.jmxremote. See Monitoring and Management Using JMX.

Given the above, your main class should look something like this:

                public void main(String[] args){
                        //initialize the cluster controller bean
                        ClusterControllerBean ccbean = new ClusterControllerBean();
                        ccbean.init();
                
                        //obtain a handle to the cluster controller client.     
                    ClusterControllerClient cc = 
                        ClusterControllerClientManager.getDefaultClient();
                    
                    //list crawlers
                    Collection crawlers = cc.listCrawlers();
                    
                    //create a new crawler 
                    Crawler crawler = cc.createCrawler();
                    
                    //etc...
                }
                
        

Once you've initialized the cluster controller bean, you should be able to view it using jconsole or any other standard jmx viewer.

Other Configuration Options

Given the above simple configuration, the assumption is that the ClusterControllerClient will communicate with ClusterControllerBean (MBean) via a jmx port on the local machine (Default is localhost:8849). If you need to change the host/port of the bean (ie you want to run the client on a different machine or in a process separate from jmx bean), you can alter it by specifying the following commandline parameter:

                -Dorg.archive.hcc.client.jmxPort=8850 -Dorg.archive.hcc.client.host=myhost
        

Other properties pertaining to the HeritrixClusterControllerBean can be specified in the hcc.properties file. The HeritrixClusterControllerBean will attempt to resolve hcc.properties in the following order:

  1. User's home directory (ie user.dir system property)
  2. JVM execution directory

the hcc.properties file

Currently there is only one property in the hcc.properties.

org.archive.hcc.ClusterControllerBean.maxPerContainer This property controls the default max number of heritrix instances per JVM. The max number can be set explicitly by host:port at runtime using the ClusterControllerClient if you need that level of specificity.
org.archive.hcc.util.OrderJarFactory.settingsDefaultsDir Specifies a default settings directory so you can specify some default settings that apply to all jobs. Please note that an order.xml file placed in the root directory will be ignored. Any user defined settings will take precedence over defaults set in this directory.

UML

Class Diagram
Class Diagram
Structure
Structure
Add-a-job Sequence Diagram
Add-a-job Sequence
Initialization Sequence Diagram
Initialization sequence
Lifecycle State Map
State Machine



Copyright © 2005-2008 The Internet Archive. All Rights Reserved.