org.archive.crawler
Class Heritrix

java.lang.Object
  extended by org.archive.crawler.Heritrix
All Implemented Interfaces:
javax.management.DynamicMBean, javax.management.MBeanRegistration

public class Heritrix
extends java.lang.Object
implements javax.management.DynamicMBean, javax.management.MBeanRegistration

Main class for Heritrix crawler. Heritrix is usually launched by a shell script that backgrounds heritrix that redirects all stdout and stderr emitted by heritrix to a log file. So that startup messages emitted subsequent to the redirection of stdout and stderr show on the console, this class prints usage or startup output such as where the web UI can be found, etc., to a STARTLOG that the shell script is waiting on. As soon as the shell script sees output in this file, it prints its content and breaks out of its wait. See ${HERITRIX_HOME}/bin/heritrix.

Heritrix can also be embedded or launched by webapp initialization or by JMX bootstrapping. So far I count 4 methods of instantiation:

  1. From this classes main -- the method usually used;
  2. From the Heritrix UI (The local-instances.jsp) page;
  3. A creation by a JMX agent at the behest of a remote JMX client; and
  4. A container such as tomcat or jboss.

Author:
gojomo, Kristinn Sigurdsson, Stack

Field Summary
static java.lang.String ADD_CRAWL_JOB_BASEDON_OPER
           
static java.lang.String ADD_CRAWL_JOB_OPER
           
static java.lang.String ADMIN
          Web UI server, realm, context name.
static java.lang.String ALERT_OPER
           
static java.lang.String ALERTCOUNT_ATTR
           
static java.lang.String ARCHIVE_PACKAGE
          The org.archive package
static java.util.List ATTRIBUTE_LIST
           
static boolean commandLine
          Set to true if application is started from command line.
static java.lang.String COMPLETED_JOBS_OPER
           
static java.lang.String CRAWLEND_REPORT_OPER
           
static java.lang.String CRAWLER_PACKAGE
          The crawler package.
static java.lang.String CURRENTJOB_ATTR
           
static java.lang.String DEFAULT_ENCODING
          Default encoding.
static java.lang.String DEFAULT_HERITRIX_OUT
          Heritrix stderr/stdout log file.
static java.lang.String DELETE_CRAWL_JOB_OPER
           
static java.lang.String DESTROY_OPER
           
static boolean gui
          True if we're to put up a GUI.
static java.util.Collection<java.lang.String> guiHosts
          Hosts to bind the GUI webserver to.
static int guiPort
          Port to put the GUI up on.
static java.lang.String HERITRIX_PROPERTIES_PREFIX
          Prefix used on our properties we'll add to the System.properties list.
static java.lang.String INTERRUPT_OPER
           
static java.lang.String ISCRAWLING_ATTR
           
static java.lang.String ISRUNNING_ATTR
           
static java.lang.String JAR_SUFFIX
           
static java.lang.String[] JOB_KEYS
           
static java.lang.String LOG_OPER
           
static java.lang.String NEWALERTCOUNT_ATTR
           
static java.util.List OPERATION_LIST
           
static java.lang.String PENDING_JOBS_OPER
           
static java.lang.String PROPERTIES
          Name of the heritrix properties file.
static java.lang.String PROPERTIES_KEY
          Name of the key to use specifying alternate heritrix properties on command line.
static java.lang.String REBIND_JNDI_OPER
           
static java.lang.String ROOT_CONTEXT
          The root context for a webapp.
static java.lang.String SHUTDOWN_OPER
           
static java.lang.String START_CRAWLING_OPER
           
static java.lang.String START_OPER
           
static java.lang.String STARTLOG
          Heritrix start log file.
static java.lang.String STATUS_ATTR
           
static java.lang.String STOP_CRAWLING_OPER
           
static java.lang.String STOP_OPER
           
static java.lang.String SYSTEM_PREFIX
          Prefix used on other properties we'll add to the System.properties list (after stripping this prefix).
static java.lang.String TERMINATE_CRAWL_JOB_OPER
           
static java.io.File TMPDIR
           
static java.lang.String VERSION_ATTR
           
 
Constructor Summary
Heritrix()
          Constructor.
Heritrix(boolean jmxregister)
           
Heritrix(java.lang.String name, boolean jmxregister)
          Constructor.
Heritrix(java.lang.String name, boolean jmxregister, CrawlJobHandler cjh)
          Constructor.
 
Method Summary
protected  CrawlJob addCrawlJob(CrawlJob job)
           
protected  java.lang.String addCrawlJob(java.io.File order, java.lang.String name, java.lang.String description, java.lang.String seeds)
           
 java.lang.String addCrawlJob(java.lang.String orderPathOrUrl, java.lang.String name, java.lang.String description, java.lang.String seeds)
          This method is called when we have an order file to hand that we want to base a job on.
protected  java.lang.String addCrawlJob(java.net.URL url, java.net.HttpURLConnection connection, java.lang.String name, java.lang.String description, java.lang.String seeds)
           
protected  CrawlJob addCrawlJobBasedOn(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds)
           
 java.lang.String addCrawlJobBasedOn(java.lang.String jobUidOrProfile, java.lang.String name, java.lang.String description, java.lang.String seeds)
           
protected  java.lang.String addCrawlJobBasedonJar(java.io.File jarFile, java.lang.String name, java.lang.String description, java.lang.String seeds)
          Undo jar file and use as basis for a new job.
protected static javax.management.ObjectName addGuiPort(javax.management.ObjectName name)
           
protected static javax.management.ObjectName addVitals(javax.management.ObjectName name)
          Add vital stats to passed in ObjectName.
protected  javax.management.openmbean.OpenMBeanInfoSupport buildMBeanInfo()
          Build up the MBean info for Heritrix main.
protected  java.lang.String checkForEmptyPlaceHolder(java.lang.String str)
          If passed str has placeholder for the empty string, return the empty string else return orginal.
protected static void configureTrustStore()
          Configure our trust store.
protected static void containerInitialization()
          Run setup tasks for this 'container'.
protected static void copyToSystemProperty(java.lang.String key, java.lang.String value)
          Copy the given key-value into System properties, as long as there is no existing value.
protected static CrawlJob createCrawlJob(CrawlJobHandler handler, java.io.File crawlOrderFile, java.lang.String name)
           
protected  CrawlJob createCrawlJobBasedOn(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds)
           
protected static void deregisterJndi(javax.management.ObjectName name)
           
 void destroy()
          Do inverse of construction.
protected static java.lang.String doCmdLineArgs(java.lang.String[] args)
           
protected  java.lang.String doOneCrawl(java.lang.String crawlOrderFile)
          Launch the crawler without a web UI and run the passed crawl only.
protected  java.lang.String doOneCrawl(java.lang.String crawlOrderFile, CrawlStatusListener listener)
          Launch the crawler without a web UI and run passed crawl only.
 SinkHandlerLogRecord getAlert(java.lang.String id)
           
 java.util.Vector getAlerts()
           
 int getAlertsCount()
           
 java.lang.Object getAttribute(java.lang.String attribute_name)
           
 javax.management.AttributeList getAttributes(java.lang.String[] attributeNames)
           
static java.io.File getConfdir()
          Get the configuration directory.
static java.io.File getConfdir(boolean fail)
          Get the configuration directory.
protected  java.lang.String getCrawlendReport(java.lang.String jobUid, java.lang.String reportName)
          Return named crawl end report for job with passed uid.
protected static java.io.File getHeritrixHome()
          Exploit -Dheritrix.home if available to us.
static java.lang.String getHeritrixOut()
           
static SimpleHttpServer getHttpServer()
           
static java.util.Map getInstances()
           
static javax.management.ObjectName getJmxObjectName()
           
static javax.management.ObjectName getJmxObjectName(java.lang.String name)
           
static javax.management.ObjectName getJmxObjectName(java.lang.String name, java.lang.String type)
           
protected static javax.management.ObjectName getJndiContainerName()
           
protected static javax.naming.Context getJndiContext()
           
 CrawlJobHandler getJobHandler()
          Get the job handler
static java.io.File getJobsdir()
           
 javax.management.MBeanInfo getMBeanInfo()
           
 javax.management.ObjectName getMBeanName()
           
static javax.management.MBeanServer getMBeanServer()
          Get MBeanServer.
 java.util.Vector getNewAlerts()
           
 int getNewAlertsCount()
           
protected  java.lang.String getNoJmxName()
           
protected static java.io.InputStream getPropertiesInputStream()
           
protected static java.lang.Thread getShutdownThread(boolean sysexit, int exitCode, java.lang.String name)
           
static Heritrix getSingleInstance()
           
 java.lang.String getStatus()
           
protected static java.io.File getSubDir(java.lang.String subdirName)
          Get and check for existence of expected subdir.
protected static java.io.File getSubDir(java.lang.String subdirName, boolean fail)
          Get and optionally check for existence of subdir.
static java.lang.String getVersion()
          Get the heritrix version.
static java.io.File getWarsdir()
           
 java.lang.String interrupt(java.lang.String threadName)
           
 java.lang.Object invoke(java.lang.String operationName, java.lang.Object[] params, java.lang.String[] signature)
           
static boolean isCommandLine()
           
protected static boolean isDevelopment()
           
static boolean isSingleInstance()
           
 boolean isStarted()
           
protected static boolean isValidLoginPasswordString(java.lang.String str)
          Test string is valid login/password string.
 java.lang.String launch()
          Launch the crawler for a web UI.
 java.lang.String launch(java.lang.String crawlOrderFile, boolean runMode)
          Launch the crawler for a web UI.
protected static java.util.Properties loadProperties()
          Load the heritrix.properties file.
static void main(java.lang.String[] args)
          Launch program.
protected  javax.management.openmbean.TabularData makeJobsTabularData(java.util.List jobs)
           
protected static void patchLogging()
          If the user hasn't altered the default logging parameters, tighten them up somewhat: some of our libraries are way too verbose at the INFO or WARNING levels.
static void performHeritrixShutDown()
          Exit program.
static void performHeritrixShutDown(int exitCode)
          Exit program.
 void postDeregister()
           
 void postRegister(java.lang.Boolean registrationDone)
           
 void preDeregister()
           
static void prepareHeritrixShutDown()
          Prepars for program shutdown.
 javax.management.ObjectName preRegister(javax.management.MBeanServer server, javax.management.ObjectName name)
           
 void readAlert(java.lang.String id)
           
protected static void registerContainerJndi()
           
protected static void registerHeritrix(Heritrix h, java.lang.String name, boolean jmxregister)
          Register Heritrix with JNDI, JMX, and with the static hashtable of all Heritrix instances known to this JVM.
protected static void registerJndi(javax.management.ObjectName name)
           
static javax.management.MBeanServer registerMBean(javax.management.MBeanServer server, java.lang.Object objToRegister, javax.management.ObjectName objName)
           
static javax.management.MBeanServer registerMBean(javax.management.MBeanServer server, java.lang.Object objToRegister, java.lang.String name, java.lang.String type)
           
static javax.management.MBeanServer registerMBean(java.lang.Object objToRegister, java.lang.String name, java.lang.String type)
           
 void removeAlert(java.lang.String id)
           
static void resetAuthentication(java.lang.String newUsername, java.lang.String newPassword)
          Replace existing administrator login info with new info.
protected static java.lang.String selftest(java.lang.String oneSelfTestName, int port)
          Run the selftest
 void setAttribute(javax.management.Attribute attribute)
           
 javax.management.AttributeList setAttributes(javax.management.AttributeList attributes)
           
static void shutdown()
           
static void shutdown(int exitCode)
          Shutdown all running heritrix instances and the JVM.
 void start()
          Start Heritrix.
 void startCrawling()
           
protected static java.lang.String startEmbeddedWebserver(java.util.Collection<java.lang.String> hosts, int port, java.lang.String adminLoginPassword)
          Start up the embedded Jetty webserver instance.
protected static java.lang.String startEmbeddedWebserver(int port, boolean lho, java.lang.String adminLoginPassword)
          Deprecated. Use startEmbeddedWebserver(hosts, port, adminLoginPassword)
 void stop()
          Stop Heritrix.
 void stopCrawling()
           
protected static void unregisterHeritrix(Heritrix h)
           
static void unregisterMBean(javax.management.MBeanServer server, javax.management.ObjectName name)
           
static void unregisterMBean(javax.management.MBeanServer server, java.lang.String name, java.lang.String type)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TMPDIR

public static final java.io.File TMPDIR

PROPERTIES

public static final java.lang.String PROPERTIES
Name of the heritrix properties file.

See Also:
Constant Field Values

PROPERTIES_KEY

public static final java.lang.String PROPERTIES_KEY
Name of the key to use specifying alternate heritrix properties on command line.

See Also:
Constant Field Values

HERITRIX_PROPERTIES_PREFIX

public static final java.lang.String HERITRIX_PROPERTIES_PREFIX
Prefix used on our properties we'll add to the System.properties list.

See Also:
Constant Field Values

SYSTEM_PREFIX

public static final java.lang.String SYSTEM_PREFIX
Prefix used on other properties we'll add to the System.properties list (after stripping this prefix).

See Also:
Constant Field Values

STARTLOG

public static final java.lang.String STARTLOG
Heritrix start log file. This file contains standard out produced by this main class for startup only. Used by heritrix shell script. Name here MUST match that in the bin/heritrix shell script. This is a DEPENDENCY the shell wrapper has on this here java heritrix.

See Also:
Constant Field Values

DEFAULT_ENCODING

public static final java.lang.String DEFAULT_ENCODING
Default encoding. Used for content when fetching if none specified.

See Also:
Constant Field Values

DEFAULT_HERITRIX_OUT

public static java.lang.String DEFAULT_HERITRIX_OUT
Heritrix stderr/stdout log file. This file should have nothing in it except messages over which we have no control (JVM stacktrace, 3rd-party lib emissions). The wrapper startup script directs stderr/stdout here. This is an INTERDEPENDENCY this program has with the wrapper shell script. Shell can actually pass us an alternate to use for this file.


ARCHIVE_PACKAGE

public static final java.lang.String ARCHIVE_PACKAGE
The org.archive package

See Also:
Constant Field Values

CRAWLER_PACKAGE

public static final java.lang.String CRAWLER_PACKAGE
The crawler package.


ROOT_CONTEXT

public static final java.lang.String ROOT_CONTEXT
The root context for a webapp.

See Also:
Constant Field Values

commandLine

public static boolean commandLine
Set to true if application is started from command line.


JAR_SUFFIX

public static final java.lang.String JAR_SUFFIX
See Also:
Constant Field Values

gui

public static boolean gui
True if we're to put up a GUI. Cmdline processing can override.


guiPort

public static int guiPort
Port to put the GUI up on. Cmdline processing can override.


guiHosts

public static java.util.Collection<java.lang.String> guiHosts
Hosts to bind the GUI webserver to. By default, only contans localhost. Set to an empty collection to indicate that all available network interfaces should be used for the webserver.


ADMIN

public static java.lang.String ADMIN
Web UI server, realm, context name.


STATUS_ATTR

public static final java.lang.String STATUS_ATTR
See Also:
Constant Field Values

VERSION_ATTR

public static final java.lang.String VERSION_ATTR
See Also:
Constant Field Values

ISRUNNING_ATTR

public static final java.lang.String ISRUNNING_ATTR
See Also:
Constant Field Values

ISCRAWLING_ATTR

public static final java.lang.String ISCRAWLING_ATTR
See Also:
Constant Field Values

ALERTCOUNT_ATTR

public static final java.lang.String ALERTCOUNT_ATTR
See Also:
Constant Field Values

NEWALERTCOUNT_ATTR

public static final java.lang.String NEWALERTCOUNT_ATTR
See Also:
Constant Field Values

CURRENTJOB_ATTR

public static final java.lang.String CURRENTJOB_ATTR
See Also:
Constant Field Values

ATTRIBUTE_LIST

public static final java.util.List ATTRIBUTE_LIST

START_OPER

public static final java.lang.String START_OPER
See Also:
Constant Field Values

STOP_OPER

public static final java.lang.String STOP_OPER
See Also:
Constant Field Values

DESTROY_OPER

public static final java.lang.String DESTROY_OPER
See Also:
Constant Field Values

INTERRUPT_OPER

public static final java.lang.String INTERRUPT_OPER
See Also:
Constant Field Values

START_CRAWLING_OPER

public static final java.lang.String START_CRAWLING_OPER
See Also:
Constant Field Values

STOP_CRAWLING_OPER

public static final java.lang.String STOP_CRAWLING_OPER
See Also:
Constant Field Values

ADD_CRAWL_JOB_OPER

public static final java.lang.String ADD_CRAWL_JOB_OPER
See Also:
Constant Field Values

TERMINATE_CRAWL_JOB_OPER

public static final java.lang.String TERMINATE_CRAWL_JOB_OPER
See Also:
Constant Field Values

DELETE_CRAWL_JOB_OPER

public static final java.lang.String DELETE_CRAWL_JOB_OPER
See Also:
Constant Field Values

ALERT_OPER

public static final java.lang.String ALERT_OPER
See Also:
Constant Field Values

ADD_CRAWL_JOB_BASEDON_OPER

public static final java.lang.String ADD_CRAWL_JOB_BASEDON_OPER
See Also:
Constant Field Values

PENDING_JOBS_OPER

public static final java.lang.String PENDING_JOBS_OPER
See Also:
Constant Field Values

COMPLETED_JOBS_OPER

public static final java.lang.String COMPLETED_JOBS_OPER
See Also:
Constant Field Values

CRAWLEND_REPORT_OPER

public static final java.lang.String CRAWLEND_REPORT_OPER
See Also:
Constant Field Values

SHUTDOWN_OPER

public static final java.lang.String SHUTDOWN_OPER
See Also:
Constant Field Values

LOG_OPER

public static final java.lang.String LOG_OPER
See Also:
Constant Field Values

REBIND_JNDI_OPER

public static final java.lang.String REBIND_JNDI_OPER
See Also:
Constant Field Values

OPERATION_LIST

public static final java.util.List OPERATION_LIST

JOB_KEYS

public static final java.lang.String[] JOB_KEYS
Constructor Detail

Heritrix

public Heritrix()
         throws java.io.IOException
Constructor. Does not register the created instance with JMX. Assumed this constructor is used by such as JMX agent creating an instance of Heritrix at the commmand of a remote client (In this case Heritrix will be registered by the invoking agent).

Throws:
java.io.IOException

Heritrix

public Heritrix(boolean jmxregister)
         throws java.io.IOException
Throws:
java.io.IOException

Heritrix

public Heritrix(java.lang.String name,
                boolean jmxregister)
         throws java.io.IOException
Constructor.

Parameters:
name - If null, we bring up the default Heritrix instance.
jmxregister - True if we are to register this instance with JMX agent.
Throws:
java.io.IOException

Heritrix

public Heritrix(java.lang.String name,
                boolean jmxregister,
                CrawlJobHandler cjh)
         throws java.io.IOException
Constructor.

Parameters:
name - If null, we bring up the default Heritrix instance.
jmxregister - True if we are to register this instance with JMX agent.
cjh - CrawlJobHandler to use.
Throws:
java.io.IOException
Method Detail

containerInitialization

protected static void containerInitialization()
                                       throws java.io.IOException
Run setup tasks for this 'container'. Idempotent.

Throws:
java.io.IOException

destroy

public void destroy()
Do inverse of construction. Used by anyone who does a 'new Heritrix' when they want to cleanup the instance. Of note, there may be Heritrix threads still hanging around after the call to destroy completes. They'll eventually go down after they've finished their cleanup routines. In particular, if you are watching Heritrix via JMX, you can see the Heritrix instance JMX bean unregister ahead of the CrawlJob JMX bean that its hosting.


main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Launch program. Optionally will launch a web server to host UI. Will also register Heritrix MBean with first found JMX Agent (Usually the 1.5.0 JVM Agent).

Parameters:
args - Command line arguments.
Throws:
java.lang.Exception

doCmdLineArgs

protected static java.lang.String doCmdLineArgs(java.lang.String[] args)
                                         throws java.lang.Exception
Throws:
java.lang.Exception

getHeritrixOut

public static java.lang.String getHeritrixOut()
Returns:
The file we dump stdout and stderr into.

getHeritrixHome

protected static java.io.File getHeritrixHome()
                                       throws java.io.IOException
Exploit -Dheritrix.home if available to us. Is current working dir if no heritrix.home property supplied.

Returns:
Heritrix home directory.
Throws:
java.io.IOException

getJobsdir

public static java.io.File getJobsdir()
                               throws java.io.IOException
Returns:
The directory into which we put jobs. If the system property 'heritrix.jobsdir' is set, we will use its value in place of the default 'jobs' directory in the current working directory.
Throws:
java.io.IOException

getSubDir

protected static java.io.File getSubDir(java.lang.String subdirName)
                                 throws java.io.IOException
Get and check for existence of expected subdir. If development flag set, then look for dir under src dir.

Parameters:
subdirName - Dir to look for.
Returns:
The extant subdir. Otherwise null if we're running in a webapp context where there is no conf directory available.
Throws:
java.io.IOException - if unable to find expected subdir.

getSubDir

protected static java.io.File getSubDir(java.lang.String subdirName,
                                        boolean fail)
                                 throws java.io.IOException
Get and optionally check for existence of subdir. If development flag set, then look for dir under src dir.

Parameters:
subdirName - Dir to look for.
fail - True if we are to fail if directory does not exist; false if we are to return false if the directory does not exist.
Returns:
The extant subdir. Otherwise null if we're running in a webapp context where there is no subdir directory available.
Throws:
java.io.IOException - if unable to find expected subdir.

isValidLoginPasswordString

protected static boolean isValidLoginPasswordString(java.lang.String str)
Test string is valid login/password string. A valid login/password string has the login and password compounded w/ a ':' delimiter.

Parameters:
str - String to test.
Returns:
True if valid password/login string.

isDevelopment

protected static boolean isDevelopment()

loadProperties

protected static java.util.Properties loadProperties()
                                              throws java.io.IOException
Load the heritrix.properties file. Adds any property that starts with HERITRIX_PROPERTIES_PREFIX or ARCHIVE_PACKAGE into system properties (except logging '.level' directives).

Returns:
Loaded properties.
Throws:
java.io.IOException

copyToSystemProperty

protected static void copyToSystemProperty(java.lang.String key,
                                           java.lang.String value)
Copy the given key-value into System properties, as long as there is no existing value.

Parameters:
key - property key
value - property value

getPropertiesInputStream

protected static java.io.InputStream getPropertiesInputStream()
                                                       throws java.io.IOException
Throws:
java.io.IOException

patchLogging

protected static void patchLogging()
                            throws java.lang.SecurityException,
                                   java.io.IOException
If the user hasn't altered the default logging parameters, tighten them up somewhat: some of our libraries are way too verbose at the INFO or WARNING levels. This might be a problem running inside in someone else's container. Container's seem to prefer commons logging so we ain't messing them doing the below.

Throws:
java.io.IOException
java.lang.SecurityException

configureTrustStore

protected static void configureTrustStore()
Configure our trust store. If system property is defined, then use it for our truststore. Otherwise use the heritrix truststore under conf directory if it exists.

If we're not launched from the command-line, we will not be able to find our truststore. The truststore is nor normally used so rare should this be a problem (In case where we don't use find our trust store, we'll use the 'default' -- either the JVMs or the containers).


selftest

protected static java.lang.String selftest(java.lang.String oneSelfTestName,
                                           int port)
                                    throws java.lang.Exception
Run the selftest

Parameters:
oneSelfTestName - Name of a test if we are to run one only rather than the default running all tests.
port - Port number to use for web UI.
Returns:
Status of how selftest startup went.
Throws:
java.lang.Exception

doOneCrawl

protected java.lang.String doOneCrawl(java.lang.String crawlOrderFile)
                               throws InitializationException,
                                      javax.management.InvalidAttributeValueException
Launch the crawler without a web UI and run the passed crawl only. Specialized version of launch().

Parameters:
crawlOrderFile - The crawl order to crawl.
Returns:
Status string.
Throws:
InitializationException
javax.management.InvalidAttributeValueException

doOneCrawl

protected java.lang.String doOneCrawl(java.lang.String crawlOrderFile,
                                      CrawlStatusListener listener)
                               throws InitializationException,
                                      javax.management.InvalidAttributeValueException
Launch the crawler without a web UI and run passed crawl only. Specialized version of launch().

Parameters:
crawlOrderFile - The crawl order to crawl.
listener - Register this crawl status listener before starting crawl (You can use this listener to notice end-of-crawl).
Returns:
Status string.
Throws:
InitializationException
javax.management.InvalidAttributeValueException

launch

public java.lang.String launch()
                        throws java.lang.Exception
Launch the crawler for a web UI. Crawler hangs around waiting on jobs.

Returns:
A status string describing how the launch went.
Throws:
java.lang.Exception
java.lang.Exception

launch

public java.lang.String launch(java.lang.String crawlOrderFile,
                               boolean runMode)
                        throws java.lang.Exception
Launch the crawler for a web UI. Crawler hangs around waiting on jobs.

Parameters:
crawlOrderFile - File to crawl. May be null.
runMode - Whether crawler should be set to run mode.
Returns:
A status string describing how the launch went.
Throws:
java.lang.Exception

startEmbeddedWebserver

protected static java.lang.String startEmbeddedWebserver(int port,
                                                         boolean lho,
                                                         java.lang.String adminLoginPassword)
                                                  throws java.lang.Exception
Deprecated. Use startEmbeddedWebserver(hosts, port, adminLoginPassword)

Start up the embedded Jetty webserver instance. This is done when we're run from the command-line.

Parameters:
port - Port number to use for web UI.
adminLoginPassword - Compound of login and password.
Returns:
Status on webserver startup.
Throws:
java.lang.Exception

startEmbeddedWebserver

protected static java.lang.String startEmbeddedWebserver(java.util.Collection<java.lang.String> hosts,
                                                         int port,
                                                         java.lang.String adminLoginPassword)
                                                  throws java.lang.Exception
Start up the embedded Jetty webserver instance. This is done when we're run from the command-line.

Parameters:
hosts - a list of IP addresses or hostnames to bind to, or an empty collection to bind to all available network interfaces
port - Port number to use for web UI.
adminLoginPassword - Compound of login and password.
Returns:
Status on webserver startup.
Throws:
java.lang.Exception

resetAuthentication

public static void resetAuthentication(java.lang.String newUsername,
                                       java.lang.String newPassword)
Replace existing administrator login info with new info.

Parameters:
newUsername - new administrator login username
newPassword - new administrator login password

createCrawlJob

protected static CrawlJob createCrawlJob(CrawlJobHandler handler,
                                         java.io.File crawlOrderFile,
                                         java.lang.String name)
                                  throws javax.management.InvalidAttributeValueException
Throws:
javax.management.InvalidAttributeValueException

addCrawlJob

public java.lang.String addCrawlJob(java.lang.String orderPathOrUrl,
                                    java.lang.String name,
                                    java.lang.String description,
                                    java.lang.String seeds)
                             throws java.io.IOException,
                                    FatalConfigurationException
This method is called when we have an order file to hand that we want to base a job on. It leaves the order file in place and just starts up a job that uses all the order points to for locations for logs, etc.

Parameters:
orderPathOrUrl - Path to an order file or to a seeds file.
name - Name to use for this job.
description -
seeds -
Returns:
A status string.
Throws:
java.io.IOException
FatalConfigurationException

addCrawlJob

protected java.lang.String addCrawlJob(java.net.URL url,
                                       java.net.HttpURLConnection connection,
                                       java.lang.String name,
                                       java.lang.String description,
                                       java.lang.String seeds)
                                throws java.io.IOException,
                                       FatalConfigurationException
Throws:
java.io.IOException
FatalConfigurationException

addCrawlJob

protected java.lang.String addCrawlJob(java.io.File order,
                                       java.lang.String name,
                                       java.lang.String description,
                                       java.lang.String seeds)
                                throws FatalConfigurationException,
                                       java.io.IOException
Throws:
FatalConfigurationException
java.io.IOException

addCrawlJobBasedonJar

protected java.lang.String addCrawlJobBasedonJar(java.io.File jarFile,
                                                 java.lang.String name,
                                                 java.lang.String description,
                                                 java.lang.String seeds)
                                          throws java.io.IOException,
                                                 FatalConfigurationException
Undo jar file and use as basis for a new job.

Parameters:
jarFile - Pointer to file that holds jar.
name - Name to use for new job.
description -
seeds -
Returns:
Message.
Throws:
java.io.IOException
FatalConfigurationException

addCrawlJobBasedOn

public java.lang.String addCrawlJobBasedOn(java.lang.String jobUidOrProfile,
                                           java.lang.String name,
                                           java.lang.String description,
                                           java.lang.String seeds)

addCrawlJobBasedOn

protected CrawlJob addCrawlJobBasedOn(java.io.File orderFile,
                                      java.lang.String name,
                                      java.lang.String description,
                                      java.lang.String seeds)
                               throws FatalConfigurationException
Throws:
FatalConfigurationException

createCrawlJobBasedOn

protected CrawlJob createCrawlJobBasedOn(java.io.File orderFile,
                                         java.lang.String name,
                                         java.lang.String description,
                                         java.lang.String seeds)
                                  throws FatalConfigurationException
Throws:
FatalConfigurationException

addCrawlJob

protected CrawlJob addCrawlJob(CrawlJob job)

startCrawling

public void startCrawling()

stopCrawling

public void stopCrawling()

getVersion

public static java.lang.String getVersion()
Get the heritrix version.

Returns:
The heritrix version. May be null.

getJobHandler

public CrawlJobHandler getJobHandler()
Get the job handler

Returns:
The CrawlJobHandler being used.

getConfdir

public static java.io.File getConfdir()
                               throws java.io.IOException
Get the configuration directory.

Returns:
The conf directory under HERITRIX_HOME or null if none can be found.
Throws:
java.io.IOException

getConfdir

public static java.io.File getConfdir(boolean fail)
                               throws java.io.IOException
Get the configuration directory.

Parameters:
fail - Throw IOE if can't find directory if true, else just return null.
Returns:
The conf directory under HERITRIX_HOME or null (or an IOE) if can't be found.
Throws:
java.io.IOException

getHttpServer

public static SimpleHttpServer getHttpServer()
Returns:
Returns the httpServer. May be null if one was not started.

getWarsdir

public static java.io.File getWarsdir()
                               throws java.io.IOException
Returns:
Returns the directory under which reside the WAR files we're to load into the servlet container.
Throws:
java.io.IOException

prepareHeritrixShutDown

public static void prepareHeritrixShutDown()
Prepars for program shutdown. This method does it's best to prepare the program so that it can exit normally. It will kill the httpServer and terminate any running job.
It is advisible to wait a few (~1000) millisec after calling this method and before calling performHeritrixShutDown() to allow as many threads as possible to finish what they are doing.


performHeritrixShutDown

public static void performHeritrixShutDown()
Exit program. Recommended that prepareHeritrixShutDown() be invoked prior to this method.


performHeritrixShutDown

public static void performHeritrixShutDown(int exitCode)
Exit program. Recommended that prepareHeritrixShutDown() be invoked prior to this method.

Parameters:
exitCode - Code to pass System.exit.

shutdown

public static void shutdown(int exitCode)
Shutdown all running heritrix instances and the JVM. Assumes stop has already been called.

Parameters:
exitCode - Exit code to pass system exit.

getShutdownThread

protected static java.lang.Thread getShutdownThread(boolean sysexit,
                                                    int exitCode,
                                                    java.lang.String name)

shutdown

public static void shutdown()

registerHeritrix

protected static void registerHeritrix(Heritrix h,
                                       java.lang.String name,
                                       boolean jmxregister)
                                throws javax.management.MalformedObjectNameException,
                                       javax.management.InstanceAlreadyExistsException,
                                       javax.management.MBeanRegistrationException,
                                       javax.management.NotCompliantMBeanException
Register Heritrix with JNDI, JMX, and with the static hashtable of all Heritrix instances known to this JVM. If launched from cmdline, register Heritrix MBean if an agent to register ourselves with. Usually this method will only have effect if we're running in a 1.5.0 JDK and command line options such as '-Dcom.sun.management.jmxremote.port=8082 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false' are supplied. See Monitoring and Management Using JMX for more on the command line options and how to connect to the Heritrix bean using the JDK 1.5.0 jconsole tool. We register currently with first server we find (TODO: Make configurable).

If we register successfully with a JMX agent, then part of the registration will include our registering ourselves with JNDI.

Finally, add the heritrix instance to the hashtable of all the Heritrix instances floating in the current VM. This latter registeration happens whether or no there is a JMX agent to register with. This is a list we keep out of convenience so its easy iterating over all all instances calling stop when main application is going down.

Parameters:
h - Instance of heritrix to register.
name - Name to use for this Heritrix instance.
jmxregister - True if we are to register this instance with JMX.
Throws:
java.lang.NullPointerException
javax.management.MalformedObjectNameException
javax.management.NotCompliantMBeanException
javax.management.MBeanRegistrationException
javax.management.InstanceAlreadyExistsException

unregisterHeritrix

protected static void unregisterHeritrix(Heritrix h)
                                  throws javax.management.InstanceNotFoundException,
                                         javax.management.MBeanRegistrationException,
                                         java.lang.NullPointerException
Throws:
javax.management.InstanceNotFoundException
javax.management.MBeanRegistrationException
java.lang.NullPointerException

getMBeanServer

public static javax.management.MBeanServer getMBeanServer()
Get MBeanServer. Currently uses first MBeanServer found. This will definetly not be whats always wanted. TODO: Make which server settable. Also, if none, put up our own MBeanServer.

Returns:
An MBeanServer to register with or null.

registerMBean

public static javax.management.MBeanServer registerMBean(java.lang.Object objToRegister,
                                                         java.lang.String name,
                                                         java.lang.String type)
                                                  throws javax.management.InstanceAlreadyExistsException,
                                                         javax.management.MBeanRegistrationException,
                                                         javax.management.NotCompliantMBeanException
Throws:
javax.management.InstanceAlreadyExistsException
javax.management.MBeanRegistrationException
javax.management.NotCompliantMBeanException

registerMBean

public static javax.management.MBeanServer registerMBean(javax.management.MBeanServer server,
                                                         java.lang.Object objToRegister,
                                                         java.lang.String name,
                                                         java.lang.String type)
                                                  throws javax.management.InstanceAlreadyExistsException,
                                                         javax.management.MBeanRegistrationException,
                                                         javax.management.NotCompliantMBeanException
Throws:
javax.management.InstanceAlreadyExistsException
javax.management.MBeanRegistrationException
javax.management.NotCompliantMBeanException

registerMBean

public static javax.management.MBeanServer registerMBean(javax.management.MBeanServer server,
                                                         java.lang.Object objToRegister,
                                                         javax.management.ObjectName objName)
                                                  throws javax.management.InstanceAlreadyExistsException,
                                                         javax.management.MBeanRegistrationException,
                                                         javax.management.NotCompliantMBeanException
Throws:
javax.management.InstanceAlreadyExistsException
javax.management.MBeanRegistrationException
javax.management.NotCompliantMBeanException

unregisterMBean

public static void unregisterMBean(javax.management.MBeanServer server,
                                   java.lang.String name,
                                   java.lang.String type)

unregisterMBean

public static void unregisterMBean(javax.management.MBeanServer server,
                                   javax.management.ObjectName name)

getNoJmxName

protected java.lang.String getNoJmxName()
Returns:
Name to use when no JMX agent available.

getJmxObjectName

public static javax.management.ObjectName getJmxObjectName()
                                                    throws javax.management.MalformedObjectNameException,
                                                           java.lang.NullPointerException
Throws:
javax.management.MalformedObjectNameException
java.lang.NullPointerException

getJmxObjectName

public static javax.management.ObjectName getJmxObjectName(java.lang.String name)
                                                    throws javax.management.MalformedObjectNameException,
                                                           java.lang.NullPointerException
Throws:
javax.management.MalformedObjectNameException
java.lang.NullPointerException

getJmxObjectName

public static javax.management.ObjectName getJmxObjectName(java.lang.String name,
                                                           java.lang.String type)
                                                    throws javax.management.MalformedObjectNameException,
                                                           java.lang.NullPointerException
Throws:
javax.management.MalformedObjectNameException
java.lang.NullPointerException

isCommandLine

public static boolean isCommandLine()
Returns:
Returns true if Heritrix was launched from the command line. (When launched from command line, we do stuff like put up a web server to manage our web interface and we register ourselves with the first available jmx agent).

isStarted

public boolean isStarted()
Returns:
True if heritrix has been started.

getStatus

public java.lang.String getStatus()

getAlertsCount

public int getAlertsCount()

getNewAlertsCount

public int getNewAlertsCount()

getAlerts

public java.util.Vector getAlerts()

getNewAlerts

public java.util.Vector getNewAlerts()

getAlert

public SinkHandlerLogRecord getAlert(java.lang.String id)

readAlert

public void readAlert(java.lang.String id)

removeAlert

public void removeAlert(java.lang.String id)

start

public void start()
Start Heritrix. Used by JMX and webapp initialization for starting Heritrix. Not by the cmdline launched Heritrix. Idempotent. If start is called by JMX, then new instance of Heritrix is automatically registered w/ JMX Agent. If started by webapp, need to register the new Heritrix instance.


stop

public void stop()
Stop Heritrix. Used by JMX and webapp initialization for stopping Heritrix.


interrupt

public java.lang.String interrupt(java.lang.String threadName)

buildMBeanInfo

protected javax.management.openmbean.OpenMBeanInfoSupport buildMBeanInfo()
Build up the MBean info for Heritrix main.

Returns:
Return created mbean info instance.

getAttribute

public java.lang.Object getAttribute(java.lang.String attribute_name)
                              throws javax.management.AttributeNotFoundException
Specified by:
getAttribute in interface javax.management.DynamicMBean
Throws:
javax.management.AttributeNotFoundException

setAttribute

public void setAttribute(javax.management.Attribute attribute)
                  throws javax.management.AttributeNotFoundException
Specified by:
setAttribute in interface javax.management.DynamicMBean
Throws:
javax.management.AttributeNotFoundException

getAttributes

public javax.management.AttributeList getAttributes(java.lang.String[] attributeNames)
Specified by:
getAttributes in interface javax.management.DynamicMBean

setAttributes

public javax.management.AttributeList setAttributes(javax.management.AttributeList attributes)
Specified by:
setAttributes in interface javax.management.DynamicMBean

invoke

public java.lang.Object invoke(java.lang.String operationName,
                               java.lang.Object[] params,
                               java.lang.String[] signature)
                        throws javax.management.ReflectionException
Specified by:
invoke in interface javax.management.DynamicMBean
Throws:
javax.management.ReflectionException

getCrawlendReport

protected java.lang.String getCrawlendReport(java.lang.String jobUid,
                                             java.lang.String reportName)
                                      throws java.io.IOException
Return named crawl end report for job with passed uid. Crawler makes reports when its finished its crawl. Use this method to get a String version of one of these files.

Parameters:
jobUid - The unique ID for the job whose reports you want to see (Must be a completed job).
reportName - Name of report minus '.txt' (e.g. crawl-report).
Returns:
String version of the on-disk report.
Throws:
java.io.IOException

makeJobsTabularData

protected javax.management.openmbean.TabularData makeJobsTabularData(java.util.List jobs)
                                                              throws javax.management.openmbean.OpenDataException
Throws:
javax.management.openmbean.OpenDataException

checkForEmptyPlaceHolder

protected java.lang.String checkForEmptyPlaceHolder(java.lang.String str)
If passed str has placeholder for the empty string, return the empty string else return orginal. Dumb jmx clients can't pass empty string so they'll pass a representation of empty string such as ' ' or '-'. Convert such strings to empty string.

Parameters:
str - String to check.
Returns:
Original str or empty string if str contains a placeholder for the empty-string (e.g. '-', or ' ').

getMBeanInfo

public javax.management.MBeanInfo getMBeanInfo()
Specified by:
getMBeanInfo in interface javax.management.DynamicMBean

getMBeanName

public javax.management.ObjectName getMBeanName()
Returns:
Name this instance registered in JMX (Only available after JMX registration).

preRegister

public javax.management.ObjectName preRegister(javax.management.MBeanServer server,
                                               javax.management.ObjectName name)
                                        throws java.lang.Exception
Specified by:
preRegister in interface javax.management.MBeanRegistration
Throws:
java.lang.Exception

addVitals

protected static javax.management.ObjectName addVitals(javax.management.ObjectName name)
                                                throws java.net.UnknownHostException,
                                                       javax.management.MalformedObjectNameException,
                                                       java.lang.NullPointerException
Add vital stats to passed in ObjectName.

Parameters:
name - ObjectName to add to.
Returns:
name with host, guiport, and jmxport added.
Throws:
java.net.UnknownHostException
javax.management.MalformedObjectNameException
java.lang.NullPointerException

addGuiPort

protected static javax.management.ObjectName addGuiPort(javax.management.ObjectName name)
                                                 throws javax.management.MalformedObjectNameException,
                                                        java.lang.NullPointerException
Throws:
javax.management.MalformedObjectNameException
java.lang.NullPointerException

postRegister

public void postRegister(java.lang.Boolean registrationDone)
Specified by:
postRegister in interface javax.management.MBeanRegistration

preDeregister

public void preDeregister()
                   throws java.lang.Exception
Specified by:
preDeregister in interface javax.management.MBeanRegistration
Throws:
java.lang.Exception

postDeregister

public void postDeregister()
Specified by:
postDeregister in interface javax.management.MBeanRegistration

registerContainerJndi

protected static void registerContainerJndi()
                                     throws javax.management.MalformedObjectNameException,
                                            java.lang.NullPointerException,
                                            java.net.UnknownHostException,
                                            javax.naming.NamingException
Throws:
javax.management.MalformedObjectNameException
java.lang.NullPointerException
java.net.UnknownHostException
javax.naming.NamingException

registerJndi

protected static void registerJndi(javax.management.ObjectName name)
                            throws java.lang.NullPointerException,
                                   javax.naming.NamingException
Throws:
java.lang.NullPointerException
javax.naming.NamingException

deregisterJndi

protected static void deregisterJndi(javax.management.ObjectName name)
                              throws java.lang.NullPointerException,
                                     javax.naming.NamingException
Throws:
java.lang.NullPointerException
javax.naming.NamingException

getJndiContext

protected static javax.naming.Context getJndiContext()
                                              throws javax.naming.NamingException
Returns:
Jndi context for the crawler or null if none found.
Throws:
javax.naming.NamingException

getJndiContainerName

protected static javax.management.ObjectName getJndiContainerName()
                                                           throws javax.management.MalformedObjectNameException,
                                                                  java.lang.NullPointerException,
                                                                  java.net.UnknownHostException
Returns:
Jndi container name -- the name to use for the 'container' that can host zero or more heritrix instances (Return a JMX ObjectName. We use ObjectName because then we're sync'd with JMX naming and ObjectName has nice parsing).
Throws:
java.lang.NullPointerException
javax.management.MalformedObjectNameException
java.net.UnknownHostException

getInstances

public static java.util.Map getInstances()
Returns:
Return all registered instances of Heritrix (Rare are there more than one).

isSingleInstance

public static boolean isSingleInstance()
Returns:
True if only one instance of Heritrix.

getSingleInstance

public static Heritrix getSingleInstance()
Returns:
Returns single instance or null if no instance or multiple.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.