org.archive.crawler.admin
Class CrawlJob

java.lang.Object
  extended by javax.management.NotificationBroadcasterSupport
      extended by org.archive.crawler.admin.CrawlJob
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, javax.management.MBeanRegistration, javax.management.NotificationBroadcaster, javax.management.NotificationEmitter, CrawlStatusListener

public class CrawlJob
extends javax.management.NotificationBroadcasterSupport
implements javax.management.DynamicMBean, javax.management.MBeanRegistration, CrawlStatusListener, java.io.Serializable

A CrawlJob encapsulates a 'crawl order' with any and all information and methods needed by a CrawlJobHandler to accept and execute them.

A given crawl job may also be a 'profile' for a crawl. In that case it should not be executed as a crawl but can be edited and used as a template for creating new CrawlJobs.

All of it's constructors are protected since only a CrawlJobHander should construct new CrawlJobs.

Author:
Kristinn Sigurdsson
See Also:
CrawlJobHandler.newJob(CrawlJob, String, String, String, String, int), CrawlJobHandler.newProfile(CrawlJob, String, String, String), Serialized Form

Nested Class Summary
 class CrawlJob.MBeanCrawlController
          Subclass of crawlcontroller that unregisters beans when stopped.
 
Field Summary
static java.lang.String[] ATTRIBUTE_ARRAY
           
static java.util.List ATTRIBUTE_LIST
           
static java.lang.String CHECKPOINT_OPER
           
static java.lang.String CRAWL_LOG_STYLE
           
static java.lang.String CRAWL_TIME_ATTR
           
static java.lang.String CRAWLJOB_JMXMBEAN_TYPE
           
static java.lang.String CURRENT_DOC_RATE_ATTR
           
static java.lang.String CURRENT_KB_RATE_ATTR
           
static java.lang.String DISCOVERED_COUNT_ATTR
           
static java.lang.String DOC_RATE_ATTR
           
static java.lang.String DOWNLOAD_COUNT_ATTR
           
static java.lang.String DUMP_URIS_OPER
           
static java.lang.String FRONTIER_REPORT_OPER
           
static java.lang.String FRONTIER_SHORT_REPORT_ATTR
           
static java.lang.String IMPORT_URI_OPER
           
static java.lang.String IMPORT_URIS_OPER
           
static java.lang.String KB_RATE_ATTR
           
static java.lang.String NAME_ATTR
           
static java.lang.String OP_DB_STAT
           
static java.util.List ORDER_EXCLUDE
          Don't add the following crawl-order items.
static java.lang.String PAUSE_OPER
           
static int PRIORITY_AVERAGE
          average
static int PRIORITY_CRITICAL
          highest
static int PRIORITY_HIGH
          high
static int PRIORITY_LOW
          low
static int PRIORITY_MINIMAL
          lowest
static java.lang.String PROG_STATS
           
static java.lang.String PROGRESS_STATISTICS_LEGEND_OPER
           
static java.lang.String PROGRESS_STATISTICS_OPER
           
static java.lang.String RECOVERY_JOURNAL_STYLE
           
static java.lang.String RESUME_OPER
           
static java.lang.String SEEDS_REPORT_OPER
           
protected  XMLSettingsHandler settingsHandler
           
static java.lang.String STATUS_ABORTED
          Job was terminted by user input while crawling
static java.lang.String STATUS_ATTR
           
static java.lang.String STATUS_CHECKPOINTING
          Job is being checkpointed.
static java.lang.String STATUS_CREATED
          Inital value.
static java.lang.String STATUS_DELETED
          Job was deleted by user, will not be displayed in UI.
static java.lang.String STATUS_FINISHED
          Job finished normally having completed its crawl.
static java.lang.String STATUS_FINISHED_ABNORMAL
          Something went very wrong
static java.lang.String STATUS_FINISHED_DATA_LIMIT
          Job finished normally when the specifed amount of data (MB) had been downloaded
static java.lang.String STATUS_FINISHED_DOCUMENT_LIMIT
          Job finished normally when the specified number of documents had been fetched.
static java.lang.String STATUS_FINISHED_TIME_LIMIT
          Job finished normally when the specified timelimit was hit.
static java.lang.String STATUS_MISCONFIGURED
          Job could not be launced due to an InitializationException
static java.lang.String STATUS_PAUSED
          Job was temporarly stopped.
static java.lang.String STATUS_PENDING
          Job has been successfully submitted to a CrawlJobHandler
static java.lang.String STATUS_PREPARING
           
static java.lang.String STATUS_PROFILE
          Job is actually a profile
static java.lang.String STATUS_RUNNING
          Job is being crawled
static java.lang.String STATUS_WAITING_FOR_PAUSE
          Job is going to be temporarly stopped after active threads are finished.
static java.lang.String THREAD_COUNT_ATTR
           
static java.lang.String THREADS_REPORT_OPER
           
static java.lang.String THREADS_SHORT_REPORT_ATTR
           
static java.lang.String TOTAL_DATA_ATTR
           
static java.lang.String UID_ATTR
           
 
Constructor Summary
protected CrawlJob()
          A shutdown Constructor.
protected CrawlJob(java.io.File jobFile, CrawlJobErrorHandler errorHandler)
          A constructor for reloading jobs from disk.
  CrawlJob(java.lang.String UID, java.lang.String name, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler, int priority, java.io.File dir)
          A constructor for jobs.
  CrawlJob(java.lang.String UID, java.lang.String name, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler, int priority, java.io.File dir, java.lang.String status, boolean isProfile, boolean isNew)
           
protected CrawlJob(java.lang.String UIDandName, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler)
          A constructor for profiles.
 
Method Summary
protected  void addBdbjeAttributes(java.util.List<javax.management.openmbean.OpenMBeanAttributeInfo> attributes, java.util.List<javax.management.MBeanAttributeInfo> bdbjeAttributes, java.util.List<java.lang.String> bdbjeNamesToAdd)
           
protected  void addBdbjeOperations(java.util.List<javax.management.openmbean.OpenMBeanOperationInfo> operations, java.util.List<javax.management.MBeanOperationInfo> bdbjeOperations, java.util.List<java.lang.String> bdbjeNamesToAdd)
           
protected  void addCrawlOrderAttributes(ComplexType type, java.util.List<javax.management.openmbean.OpenMBeanAttributeInfo> attributes)
           
protected  javax.management.openmbean.OpenMBeanInfoSupport buildMBeanInfo()
          Build up the MBean info for Heritrix main.
protected  void checkpoint()
           
 void crawlCheckpoint(java.io.File checkpointDir)
          Called by CrawlController when checkpointing.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
 void crawlEnding(java.lang.String sExitMessage)
          Called when a CrawlController is ending a crawl (for any reason)
 void crawlPaused(java.lang.String statusMessage)
          Called when a CrawlController is actually paused (all threads are idle).
 void crawlPausing(java.lang.String statusMessage)
          Called when a CrawlController is going to be paused.
 void crawlResuming(java.lang.String statusMessage)
          Called when a CrawlController is resuming a crawl that had been paused.
 void crawlStarted(java.lang.String message)
          Called on crawl start.
protected  CrawlController createCrawlController()
           
 long deleteURIsFromPending(java.lang.String regexpr)
          Delete any URI from the frontier of the current (paused) job that match the specified regular expression.
 long deleteURIsFromPending(java.lang.String uriPattern, java.lang.String queuePattern)
          Delete any URI from the frontier of the current (paused) job that match the specified regular expression.
 void dumpUris(java.lang.String filename, java.lang.String regexp, int numberOfMatches, boolean verbose)
           
protected  void flush()
          If its a HostQueuesFrontier, needs to be flushed for the queued.
 java.lang.Object getAttribute(java.lang.String attribute_name)
           
 javax.management.AttributeList getAttributes(java.lang.String[] attributeNames)
           
 CrawlController getController()
           
protected  java.lang.Object getCrawlOrderAttribute(java.lang.String attribute_name)
           
protected  java.lang.Object getCrawlOrderAttribute(java.lang.String attribute_name, ComplexType ct)
           
 java.lang.String getCrawlStatus()
           
 java.io.File getDirectory()
          Returns the path of the job's base directory.
 java.lang.String getDisplayName()
          Return the combination of given name and UID most commonly used in administrative interface.
 CrawlJobErrorHandler getErrorHandler()
           
 java.lang.String getErrorMessage()
          Get the error message associated with this job.
 java.lang.String getFrontierOneLine()
           
 java.lang.String getFrontierReport(java.lang.String reportName)
           
protected  Heritrix getHostingHeritrix()
           
 java.lang.String getIgnoredSeeds()
          Utility method to get the stored list of ignored seed items (if any), from the last time the seeds were imported to the frontier.
 FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
          Returns a URIFrontierMarker for the current, paused, job.
 java.lang.String getJmxJobName()
           
 java.lang.String getJobName()
          Returns this job's 'name'.
 int getJobPriority()
          Get this job's level of priority.
 java.lang.String getLogPath(java.lang.String log)
          Returns the absolute path of the specified log.
 javax.management.MBeanInfo getMBeanInfo()
           
protected  javax.management.ObjectName getMbeanName()
           
protected static int getNotificationsSequenceNumber()
           
 int getNumberOfJournalEntries()
           
 java.util.ArrayList<java.lang.String> getPendingURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
          Returns the frontiers URI list based on the provided marker.
 java.lang.String getProcessorsReport()
          Get the Processors report for the running crawl.
 java.lang.String getSettingsDirectory()
          Returns the directory where the configuration files for this job are located.
 XMLSettingsHandler getSettingsHandler()
          Returns the settings handler for this job.
 StatisticsTracking getStatisticsTracking()
           
 java.lang.String getStatus()
          Get the current status of this CrawlJob
 java.lang.String getThreadOneLine()
           
 java.lang.String getThreadsReport()
          Get the CrawlControllers ToeThreads report for the running crawl.
 java.lang.String getUID()
          Returns this jobs unique ID (UID) that was issued by the CrawlJobHandler() when this job was first created.
 void importUri(java.lang.String uri, boolean forceFetch, boolean isSeed)
          Schedule a uri.
 void importUri(java.lang.String str, boolean forceFetch, boolean isSeed, boolean isFlush)
          Schedule a uri.
protected  int importUris(java.io.InputStream is, java.lang.String style, boolean forceRevisit)
           
protected  int importUris(java.io.InputStream is, java.lang.String style, boolean forceRevisit, boolean areSeeds)
          Import URIs.
 java.lang.String importUris(java.lang.String fileOrUrl, java.lang.String style, boolean forceRevisit)
           
 java.lang.String importUris(java.lang.String fileOrUrl, java.lang.String style, boolean forceRevisit, boolean areSeeds)
           
 java.lang.String importUris(java.lang.String file, java.lang.String style, java.lang.String force)
           
 java.lang.Object invoke(java.lang.String operationName, java.lang.Object[] params, java.lang.String[] signature)
           
 boolean isCheckpointing()
           
 boolean isCrawling()
           
 boolean isNew()
          Is this a new job?
 boolean isProfile()
          Set if the job is considered to be a profile
 boolean isReadOnly()
          Is job read only?
 boolean isRunning()
          Returns true if the job is being crawled.
 void kickUpdate()
          Forward a 'kick' update to current controller if any.
 void killThread(int threadNumber, boolean replace)
          Kills a thread.
 void mustBeCrawling()
           
protected  void pause()
           
 void postDeregister()
           
 void postRegister(java.lang.Boolean registrationDone)
           
 void preDeregister()
           
 javax.management.ObjectName preRegister(javax.management.MBeanServer server, javax.management.ObjectName on)
           
protected  void resume()
           
 java.util.Collection scanCheckpoints()
          Read all the checkpoints found in the job's checkpoints directory into Checkpoint instances
 void setAttribute(javax.management.Attribute attribute)
           
protected  void setAttributeInternal(javax.management.Attribute attribute)
           
 javax.management.AttributeList setAttributes(javax.management.AttributeList attributes)
           
protected  void setCrawlOrderAttribute(java.lang.String attribute_name, ComplexType ct, javax.management.Attribute attribute)
           
 void setErrorMessage(java.lang.String string)
          Set an error message for this job.
 void setJobPriority(int priority)
          Set this job's level of priority.
 void setNew(boolean b)
          Set if the job is considered a new job or not.
 void setNumberOfJournalEntries(int numberOfJournalEntries)
           
 void setReadOnly()
          Once called no changes can be made to the settings for this job.
protected  void setRunning(boolean b)
          Set if job is being crawled.
 void setStatus(java.lang.String status)
          Set the status of this CrawlJob.
protected  CrawlController setupCrawlController()
           
 void setupForCrawlStart()
           
 void stopCrawling()
           
protected  void unregisterMBean()
           
 void writeFrontierReport(java.lang.String reportName, java.io.PrintWriter writer)
          Write the requested frontier report to the given PrintWriter
 void writeThreadsReport(java.lang.String reportName, java.io.PrintWriter writer)
          Write the requested threads report to the given PrintWriter
 
Methods inherited from class javax.management.NotificationBroadcasterSupport
addNotificationListener, getNotificationInfo, handleNotification, removeNotificationListener, removeNotificationListener, sendNotification
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

PRIORITY_MINIMAL

public static final int PRIORITY_MINIMAL
lowest

See Also:
Constant Field Values

PRIORITY_LOW

public static final int PRIORITY_LOW
low

See Also:
Constant Field Values

PRIORITY_AVERAGE

public static final int PRIORITY_AVERAGE
average

See Also:
Constant Field Values

PRIORITY_HIGH

public static final int PRIORITY_HIGH
high

See Also:
Constant Field Values

PRIORITY_CRITICAL

public static final int PRIORITY_CRITICAL
highest

See Also:
Constant Field Values

STATUS_CREATED

public static final java.lang.String STATUS_CREATED
Inital value. May not be ready to run/incomplete.

See Also:
Constant Field Values

STATUS_PENDING

public static final java.lang.String STATUS_PENDING
Job has been successfully submitted to a CrawlJobHandler

See Also:
Constant Field Values

STATUS_RUNNING

public static final java.lang.String STATUS_RUNNING
Job is being crawled

See Also:
Constant Field Values

STATUS_DELETED

public static final java.lang.String STATUS_DELETED
Job was deleted by user, will not be displayed in UI.

See Also:
Constant Field Values

STATUS_ABORTED

public static final java.lang.String STATUS_ABORTED
Job was terminted by user input while crawling

See Also:
Constant Field Values

STATUS_FINISHED_ABNORMAL

public static final java.lang.String STATUS_FINISHED_ABNORMAL
Something went very wrong

See Also:
Constant Field Values

STATUS_FINISHED

public static final java.lang.String STATUS_FINISHED
Job finished normally having completed its crawl.

See Also:
Constant Field Values

STATUS_FINISHED_TIME_LIMIT

public static final java.lang.String STATUS_FINISHED_TIME_LIMIT
Job finished normally when the specified timelimit was hit.

See Also:
Constant Field Values

STATUS_FINISHED_DATA_LIMIT

public static final java.lang.String STATUS_FINISHED_DATA_LIMIT
Job finished normally when the specifed amount of data (MB) had been downloaded

See Also:
Constant Field Values

STATUS_FINISHED_DOCUMENT_LIMIT

public static final java.lang.String STATUS_FINISHED_DOCUMENT_LIMIT
Job finished normally when the specified number of documents had been fetched.

See Also:
Constant Field Values

STATUS_WAITING_FOR_PAUSE

public static final java.lang.String STATUS_WAITING_FOR_PAUSE
Job is going to be temporarly stopped after active threads are finished.

See Also:
Constant Field Values

STATUS_PAUSED

public static final java.lang.String STATUS_PAUSED
Job was temporarly stopped. State is kept so it can be resumed

See Also:
Constant Field Values

STATUS_CHECKPOINTING

public static final java.lang.String STATUS_CHECKPOINTING
Job is being checkpointed. When finished checkpointing, job is set back to STATUS_PAUSED (Job must be first paused before checkpointing will run).

See Also:
Constant Field Values

STATUS_MISCONFIGURED

public static final java.lang.String STATUS_MISCONFIGURED
Job could not be launced due to an InitializationException

See Also:
Constant Field Values

STATUS_PROFILE

public static final java.lang.String STATUS_PROFILE
Job is actually a profile

See Also:
Constant Field Values

STATUS_PREPARING

public static final java.lang.String STATUS_PREPARING
See Also:
Constant Field Values

settingsHandler

protected transient XMLSettingsHandler settingsHandler

RECOVERY_JOURNAL_STYLE

public static final java.lang.String RECOVERY_JOURNAL_STYLE
See Also:
Constant Field Values

CRAWL_LOG_STYLE

public static final java.lang.String CRAWL_LOG_STYLE
See Also:
Constant Field Values

CRAWLJOB_JMXMBEAN_TYPE

public static final java.lang.String CRAWLJOB_JMXMBEAN_TYPE
See Also:
Constant Field Values

NAME_ATTR

public static final java.lang.String NAME_ATTR
See Also:
Constant Field Values

UID_ATTR

public static final java.lang.String UID_ATTR
See Also:
Constant Field Values

STATUS_ATTR

public static final java.lang.String STATUS_ATTR
See Also:
Constant Field Values

FRONTIER_SHORT_REPORT_ATTR

public static final java.lang.String FRONTIER_SHORT_REPORT_ATTR
See Also:
Constant Field Values

THREADS_SHORT_REPORT_ATTR

public static final java.lang.String THREADS_SHORT_REPORT_ATTR
See Also:
Constant Field Values

TOTAL_DATA_ATTR

public static final java.lang.String TOTAL_DATA_ATTR
See Also:
Constant Field Values

CRAWL_TIME_ATTR

public static final java.lang.String CRAWL_TIME_ATTR
See Also:
Constant Field Values

DOC_RATE_ATTR

public static final java.lang.String DOC_RATE_ATTR
See Also:
Constant Field Values

CURRENT_DOC_RATE_ATTR

public static final java.lang.String CURRENT_DOC_RATE_ATTR
See Also:
Constant Field Values

KB_RATE_ATTR

public static final java.lang.String KB_RATE_ATTR
See Also:
Constant Field Values

CURRENT_KB_RATE_ATTR

public static final java.lang.String CURRENT_KB_RATE_ATTR
See Also:
Constant Field Values

THREAD_COUNT_ATTR

public static final java.lang.String THREAD_COUNT_ATTR
See Also:
Constant Field Values

DOWNLOAD_COUNT_ATTR

public static final java.lang.String DOWNLOAD_COUNT_ATTR
See Also:
Constant Field Values

DISCOVERED_COUNT_ATTR

public static final java.lang.String DISCOVERED_COUNT_ATTR
See Also:
Constant Field Values

ATTRIBUTE_ARRAY

public static final java.lang.String[] ATTRIBUTE_ARRAY

ATTRIBUTE_LIST

public static final java.util.List ATTRIBUTE_LIST

IMPORT_URI_OPER

public static final java.lang.String IMPORT_URI_OPER
See Also:
Constant Field Values

IMPORT_URIS_OPER

public static final java.lang.String IMPORT_URIS_OPER
See Also:
Constant Field Values

DUMP_URIS_OPER

public static final java.lang.String DUMP_URIS_OPER
See Also:
Constant Field Values

PAUSE_OPER

public static final java.lang.String PAUSE_OPER
See Also:
Constant Field Values

RESUME_OPER

public static final java.lang.String RESUME_OPER
See Also:
Constant Field Values

FRONTIER_REPORT_OPER

public static final java.lang.String FRONTIER_REPORT_OPER
See Also:
Constant Field Values

THREADS_REPORT_OPER

public static final java.lang.String THREADS_REPORT_OPER
See Also:
Constant Field Values

SEEDS_REPORT_OPER

public static final java.lang.String SEEDS_REPORT_OPER
See Also:
Constant Field Values

CHECKPOINT_OPER

public static final java.lang.String CHECKPOINT_OPER
See Also:
Constant Field Values

PROGRESS_STATISTICS_OPER

public static final java.lang.String PROGRESS_STATISTICS_OPER
See Also:
Constant Field Values

PROGRESS_STATISTICS_LEGEND_OPER

public static final java.lang.String PROGRESS_STATISTICS_LEGEND_OPER
See Also:
Constant Field Values

PROG_STATS

public static final java.lang.String PROG_STATS
See Also:
Constant Field Values

OP_DB_STAT

public static final java.lang.String OP_DB_STAT
See Also:
Constant Field Values

ORDER_EXCLUDE

public static final java.util.List ORDER_EXCLUDE
Don't add the following crawl-order items.

Constructor Detail

CrawlJob

protected CrawlJob()
A shutdown Constructor.


CrawlJob

public CrawlJob(java.lang.String UID,
                java.lang.String name,
                XMLSettingsHandler settingsHandler,
                CrawlJobErrorHandler errorHandler,
                int priority,
                java.io.File dir)
A constructor for jobs.

Create, ready to crawl, jobs.

Parameters:
UID - A unique ID for this job. Typically emitted by the CrawlJobHandler.
name - The name of the job
settingsHandler - The associated settings
errorHandler - The crawl jobs settings error handler. null means none is set
priority - job priority.
dir - The directory that is considered this jobs working directory.

CrawlJob

protected CrawlJob(java.lang.String UIDandName,
                   XMLSettingsHandler settingsHandler,
                   CrawlJobErrorHandler errorHandler)
A constructor for profiles.

Any job created with this constructor will be considered a profile. Profiles are not stored on disk (only their settings files are stored on disk). This is because their data is predictible given any settings files.

Parameters:
UIDandName - A unique ID for this job. For profiles this is the same as name
settingsHandler - The associated settings
errorHandler - The crawl jobs settings error handler. null means none is set

CrawlJob

public CrawlJob(java.lang.String UID,
                java.lang.String name,
                XMLSettingsHandler settingsHandler,
                CrawlJobErrorHandler errorHandler,
                int priority,
                java.io.File dir,
                java.lang.String status,
                boolean isProfile,
                boolean isNew)

CrawlJob

protected CrawlJob(java.io.File jobFile,
                   CrawlJobErrorHandler errorHandler)
            throws InvalidJobFileException,
                   java.io.IOException
A constructor for reloading jobs from disk. Jobs (not profiles) have their data written to persistent storage in the file system. This method is used to load the job from such storage. This is done by the CrawlJobHandler.

Proper structure of a job file (TODO: Maybe one day make this an XML file) Line 1. UID
Line 2. Job name (string)
Line 3. Job status (string)
Line 4. is job read only (true/false)
Line 5. is job running (true/false)
Line 6. job priority (int)
Line 7. number of journal entries
Line 8. setting file (with path)
Line 9. statistics tracker file (with path)
Line 10-?. error message (String, empty for null), can be many lines

Parameters:
jobFile - a file containing information about the job to load.
errorHandler - The crawl jobs settings error handler. null means none is set
Throws:
InvalidJobFileException - if the specified file does not refer to a valid job file.
java.io.IOException - if io operations fail
Method Detail

getUID

public java.lang.String getUID()
Returns this jobs unique ID (UID) that was issued by the CrawlJobHandler() when this job was first created.

Returns:
Job This jobs UID.
See Also:
CrawlJobHandler.getNextJobUID()

getJobName

public java.lang.String getJobName()
Returns this job's 'name'. The name comes from the settings for this job, need not be unique and may change. For a unique identifier use getUID().

The name corrisponds to the value of the 'name' tag in the 'meta' section of the settings file.

Returns:
This job's 'name'

getDisplayName

public java.lang.String getDisplayName()
Return the combination of given name and UID most commonly used in administrative interface.

Returns:
Job's name with UID notation

setJobPriority

public void setJobPriority(int priority)
Set this job's level of priority.

Parameters:
priority - The level of priority
See Also:
getJobPriority(), PRIORITY_MINIMAL, PRIORITY_LOW, PRIORITY_AVERAGE, PRIORITY_HIGH, PRIORITY_CRITICAL

getJobPriority

public int getJobPriority()
Get this job's level of priority.

Returns:
this job's priority
See Also:
setJobPriority(int), PRIORITY_MINIMAL, PRIORITY_LOW, PRIORITY_AVERAGE, PRIORITY_HIGH, PRIORITY_CRITICAL

setReadOnly

public void setReadOnly()
Once called no changes can be made to the settings for this job. Typically this is done once a crawl is completed and further changes to the crawl order are therefor meaningless.


isReadOnly

public boolean isReadOnly()
Is job read only?

Returns:
false until setReadOnly has been invoked, after that it returns true.

setStatus

public void setStatus(java.lang.String status)
Set the status of this CrawlJob.

Parameters:
status - Current status of CrawlJob (see constants defined here beginning with STATUS)

getCrawlStatus

public java.lang.String getCrawlStatus()
Returns:
Status of the crawler (Used by JMX).

getStatus

public java.lang.String getStatus()
Get the current status of this CrawlJob

Returns:
The current status of this CrawlJob (see constants defined here beginning with STATUS)

getSettingsHandler

public XMLSettingsHandler getSettingsHandler()
Returns the settings handler for this job. It will have been initialized.

Returns:
the settings handler for this job.

isNew

public boolean isNew()
Is this a new job?

Returns:
True if is new.

isProfile

public boolean isProfile()
Set if the job is considered to be a profile

Returns:
True if is a profile.

setNew

public void setNew(boolean b)
Set if the job is considered a new job or not.

Parameters:
b - Is the job considered to be new.

isRunning

public boolean isRunning()
Returns true if the job is being crawled.

Returns:
true if the job is being crawled

setRunning

protected void setRunning(boolean b)
Set if job is being crawled.

Parameters:
b - Is job being crawled.

unregisterMBean

protected void unregisterMBean()

setupCrawlController

protected CrawlController setupCrawlController()
                                        throws InitializationException
Throws:
InitializationException

createCrawlController

protected CrawlController createCrawlController()

setupForCrawlStart

public void setupForCrawlStart()
                        throws InitializationException
Throws:
InitializationException

stopCrawling

public void stopCrawling()

getFrontierOneLine

public java.lang.String getFrontierOneLine()
Returns:
One-line Frontier report.

getFrontierReport

public java.lang.String getFrontierReport(java.lang.String reportName)
Parameters:
reportName - Name of report to write.
Returns:
A report of the frontier's status.

writeFrontierReport

public void writeFrontierReport(java.lang.String reportName,
                                java.io.PrintWriter writer)
Write the requested frontier report to the given PrintWriter

Parameters:
reportName - Name of report to write.
writer - Where to write to.

getThreadOneLine

public java.lang.String getThreadOneLine()
Returns:
One-line threads report.

getThreadsReport

public java.lang.String getThreadsReport()
Get the CrawlControllers ToeThreads report for the running crawl.

Returns:
The CrawlControllers ToeThreads report

writeThreadsReport

public void writeThreadsReport(java.lang.String reportName,
                               java.io.PrintWriter writer)
Write the requested threads report to the given PrintWriter

Parameters:
reportName - Name of report to write.
writer - Where to write to.

killThread

public void killThread(int threadNumber,
                       boolean replace)
Kills a thread. For details see ToePool.killThread(int, boolean).

Parameters:
threadNumber - Thread to kill.
replace - Should thread be replaced.
See Also:
ToePool.killThread(int, boolean)

getProcessorsReport

public java.lang.String getProcessorsReport()
Get the Processors report for the running crawl.

Returns:
The Processors report for the running crawl.

getSettingsDirectory

public java.lang.String getSettingsDirectory()
Returns the directory where the configuration files for this job are located.

Returns:
the directory where the configuration files for this job are located

getDirectory

public java.io.File getDirectory()
Returns the path of the job's base directory. For profiles this is always equal to new File(getSettingsDirectory()).

Returns:
the path of the job's base directory.

getErrorMessage

public java.lang.String getErrorMessage()
Get the error message associated with this job. Will return null if there is no error message.

Returns:
the error message associated with this job

setErrorMessage

public void setErrorMessage(java.lang.String string)
Set an error message for this job. Generally this only occurs if the job is misconfigured.

Parameters:
string - the error message associated with this job

getNumberOfJournalEntries

public int getNumberOfJournalEntries()
Returns:
Returns the number of journal entries.

setNumberOfJournalEntries

public void setNumberOfJournalEntries(int numberOfJournalEntries)
Parameters:
numberOfJournalEntries - The number of journal entries to set.

getErrorHandler

public CrawlJobErrorHandler getErrorHandler()
Returns:
Returns the error handler for this crawl job

scanCheckpoints

public java.util.Collection scanCheckpoints()
Read all the checkpoints found in the job's checkpoints directory into Checkpoint instances

Returns:
Collection containing list of all checkpoints.

getLogPath

public java.lang.String getLogPath(java.lang.String log)
                            throws javax.management.AttributeNotFoundException,
                                   javax.management.MBeanException,
                                   javax.management.ReflectionException
Returns the absolute path of the specified log. Note: If crawl has not begun, this file may not exist.

Parameters:
log -
Returns:
the absolute path for the specified log.
Throws:
javax.management.AttributeNotFoundException
javax.management.ReflectionException
javax.management.MBeanException

pause

protected void pause()

resume

protected void resume()

checkpoint

protected void checkpoint()
                   throws java.lang.IllegalStateException
Throws:
java.lang.IllegalStateException - Thrown if crawl is not paused.

isCheckpointing

public boolean isCheckpointing()
Returns:
True if checkpointing.

flush

protected void flush()
If its a HostQueuesFrontier, needs to be flushed for the queued.


deleteURIsFromPending

public long deleteURIsFromPending(java.lang.String regexpr)
Delete any URI from the frontier of the current (paused) job that match the specified regular expression. If the current job is not paused (or there is no current job) nothing will be done.

Parameters:
regexpr - Regular expression to delete URIs by.
Returns:
the number of URIs deleted

deleteURIsFromPending

public long deleteURIsFromPending(java.lang.String uriPattern,
                                  java.lang.String queuePattern)
Delete any URI from the frontier of the current (paused) job that match the specified regular expression. If the current job is not paused (or there is no current job) nothing will be done.

Parameters:
regexpr - Regular expression to delete URIs by.
Returns:
the number of URIs deleted

importUris

public java.lang.String importUris(java.lang.String file,
                                   java.lang.String style,
                                   java.lang.String force)

importUris

public java.lang.String importUris(java.lang.String fileOrUrl,
                                   java.lang.String style,
                                   boolean forceRevisit)

importUris

public java.lang.String importUris(java.lang.String fileOrUrl,
                                   java.lang.String style,
                                   boolean forceRevisit,
                                   boolean areSeeds)
Parameters:
fileOrUrl - Name of file w/ seeds.
style - What style of seeds -- crawl log, recovery journal, or seeds file.
forceRevisit - Should we revisit even if seen before?
areSeeds - Is the file exclusively seeds?
Returns:
A display string that has a count of all added.

importUris

protected int importUris(java.io.InputStream is,
                         java.lang.String style,
                         boolean forceRevisit)

importUris

protected int importUris(java.io.InputStream is,
                         java.lang.String style,
                         boolean forceRevisit,
                         boolean areSeeds)
Import URIs.

Parameters:
is - Stream to use as URI source.
style - Style in which URIs are rendored. Currently support for recoveryJournal, crawlLog, and seeds file format (i.e default) where default style is a UURI per line (comments allowed).
forceRevisit - Whether we should revisit this URI even if we've visited it previously.
areSeeds - Are the imported URIs seeds?
Returns:
Count of added URIs.

importUri

public void importUri(java.lang.String uri,
                      boolean forceFetch,
                      boolean isSeed)
               throws org.apache.commons.httpclient.URIException
Schedule a uri.

Parameters:
uri - Uri to schedule.
forceFetch - Should it be forcefetched.
isSeed - True if seed.
Throws:
org.apache.commons.httpclient.URIException

importUri

public void importUri(java.lang.String str,
                      boolean forceFetch,
                      boolean isSeed,
                      boolean isFlush)
               throws org.apache.commons.httpclient.URIException
Schedule a uri.

Parameters:
str - String that can be: 1. a UURI, 2. a snippet of the crawl.log line, or 3. a snippet from recover log. See importUris(InputStream, String, boolean) for how it subparses the lines from crawl.log and recover.log.
forceFetch - Should it be forcefetched.
isSeed - True if seed.
isFlush - If true, flush the frontier IF it implements flushing.
Throws:
org.apache.commons.httpclient.URIException

getMBeanInfo

public javax.management.MBeanInfo getMBeanInfo()
Specified by:
getMBeanInfo in interface javax.management.DynamicMBean
Returns:
Our mbean info (Needed for CrawlJob to qualify as a DynamicMBean).

buildMBeanInfo

protected javax.management.openmbean.OpenMBeanInfoSupport buildMBeanInfo()
                                                                  throws InitializationException
Build up the MBean info for Heritrix main.

Returns:
Return created mbean info instance.
Throws:
InitializationException

addBdbjeAttributes

protected void addBdbjeAttributes(java.util.List<javax.management.openmbean.OpenMBeanAttributeInfo> attributes,
                                  java.util.List<javax.management.MBeanAttributeInfo> bdbjeAttributes,
                                  java.util.List<java.lang.String> bdbjeNamesToAdd)

addBdbjeOperations

protected void addBdbjeOperations(java.util.List<javax.management.openmbean.OpenMBeanOperationInfo> operations,
                                  java.util.List<javax.management.MBeanOperationInfo> bdbjeOperations,
                                  java.util.List<java.lang.String> bdbjeNamesToAdd)

addCrawlOrderAttributes

protected void addCrawlOrderAttributes(ComplexType type,
                                       java.util.List<javax.management.openmbean.OpenMBeanAttributeInfo> attributes)

getAttribute

public java.lang.Object getAttribute(java.lang.String attribute_name)
                              throws javax.management.AttributeNotFoundException
Specified by:
getAttribute in interface javax.management.DynamicMBean
Throws:
javax.management.AttributeNotFoundException

getCrawlOrderAttribute

protected java.lang.Object getCrawlOrderAttribute(java.lang.String attribute_name)

getCrawlOrderAttribute

protected java.lang.Object getCrawlOrderAttribute(java.lang.String attribute_name,
                                                  ComplexType ct)
                                           throws javax.management.AttributeNotFoundException,
                                                  javax.management.MBeanException,
                                                  javax.management.ReflectionException
Throws:
javax.management.AttributeNotFoundException
javax.management.MBeanException
javax.management.ReflectionException

getAttributes

public javax.management.AttributeList getAttributes(java.lang.String[] attributeNames)
Specified by:
getAttributes in interface javax.management.DynamicMBean

setAttribute

public void setAttribute(javax.management.Attribute attribute)
                  throws javax.management.AttributeNotFoundException
Specified by:
setAttribute in interface javax.management.DynamicMBean
Throws:
javax.management.AttributeNotFoundException

setAttributeInternal

protected void setAttributeInternal(javax.management.Attribute attribute)
                             throws javax.management.AttributeNotFoundException
Throws:
javax.management.AttributeNotFoundException

setCrawlOrderAttribute

protected void setCrawlOrderAttribute(java.lang.String attribute_name,
                                      ComplexType ct,
                                      javax.management.Attribute attribute)
                               throws javax.management.AttributeNotFoundException,
                                      javax.management.InvalidAttributeValueException,
                                      javax.management.MBeanException,
                                      javax.management.ReflectionException
Throws:
javax.management.AttributeNotFoundException
javax.management.InvalidAttributeValueException
javax.management.MBeanException
javax.management.ReflectionException

setAttributes

public javax.management.AttributeList setAttributes(javax.management.AttributeList attributes)
Specified by:
setAttributes in interface javax.management.DynamicMBean

invoke

public java.lang.Object invoke(java.lang.String operationName,
                               java.lang.Object[] params,
                               java.lang.String[] signature)
                        throws javax.management.ReflectionException
Specified by:
invoke in interface javax.management.DynamicMBean
Throws:
javax.management.ReflectionException

mustBeCrawling

public void mustBeCrawling()

isCrawling

public boolean isCrawling()

getIgnoredSeeds

public java.lang.String getIgnoredSeeds()
Utility method to get the stored list of ignored seed items (if any), from the last time the seeds were imported to the frontier.

Returns:
String of all ignored seed items, or null if none

kickUpdate

public void kickUpdate()
Forward a 'kick' update to current controller if any.

See Also:
CrawlController.kickUpdate()

getInitialMarker

public FrontierMarker getInitialMarker(java.lang.String regexpr,
                                       boolean inCacheOnly)
Returns a URIFrontierMarker for the current, paused, job. If there is no current job or it is not paused null will be returned.

Parameters:
regexpr - A regular expression that each URI must match in order to be considered 'within' the marker.
inCacheOnly - Limit marker scope to 'cached' URIs.
Returns:
a URIFrontierMarker for the current job.
See Also:
getPendingURIsList(FrontierMarker, int, boolean), Frontier.getInitialMarker(String, boolean), FrontierMarker

getPendingURIsList

public java.util.ArrayList<java.lang.String> getPendingURIsList(FrontierMarker marker,
                                                                int numberOfMatches,
                                                                boolean verbose)
                                                         throws InvalidFrontierMarkerException
Returns the frontiers URI list based on the provided marker. This method will return null if there is not current job or if the current job is not paused. Only when there is a paused current job will this method return a URI list.

Parameters:
marker - URIFrontier marker
numberOfMatches - Maximum number of matches to return
verbose - Should detailed info be provided on each URI?
Returns:
the frontiers URI list based on the provided marker
Throws:
InvalidFrontierMarkerException - When marker is inconsistent with the current state of the frontier.
See Also:
getInitialMarker(String, boolean), FrontierMarker

dumpUris

public void dumpUris(java.lang.String filename,
                     java.lang.String regexp,
                     int numberOfMatches,
                     boolean verbose)

crawlStarted

public void crawlStarted(java.lang.String message)
Description copied from interface: CrawlStatusListener
Called on crawl start.

Specified by:
crawlStarted in interface CrawlStatusListener
Parameters:
message - Start message.

crawlEnding

public void crawlEnding(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is ending a crawl (for any reason)

Specified by:
crawlEnding in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlPausing

public void crawlPausing(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is going to be paused.

Specified by:
crawlPausing in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience

crawlPaused

public void crawlPaused(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is actually paused (all threads are idle).

Specified by:
crawlPaused in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_PAUSED. Passed for convenience

crawlResuming

public void crawlResuming(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is resuming a crawl that had been paused.

Specified by:
crawlResuming in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_RUNNING. Passed for convenience

crawlCheckpoint

public void crawlCheckpoint(java.io.File checkpointDir)
                     throws java.lang.Exception
Description copied from interface: CrawlStatusListener
Called by CrawlController when checkpointing.

Specified by:
crawlCheckpoint in interface CrawlStatusListener
Parameters:
checkpointDir - Checkpoint dir. Write checkpoint state here.
Throws:
java.lang.Exception - A fatal exception. Any exceptions that are let out of this checkpoint are assumed fatal and terminate further checkpoint processing.

getController

public CrawlController getController()

preRegister

public javax.management.ObjectName preRegister(javax.management.MBeanServer server,
                                               javax.management.ObjectName on)
                                        throws java.lang.Exception
Specified by:
preRegister in interface javax.management.MBeanRegistration
Throws:
java.lang.Exception

postRegister

public void postRegister(java.lang.Boolean registrationDone)
Specified by:
postRegister in interface javax.management.MBeanRegistration

preDeregister

public void preDeregister()
                   throws java.lang.Exception
Specified by:
preDeregister in interface javax.management.MBeanRegistration
Throws:
java.lang.Exception

postDeregister

public void postDeregister()
Specified by:
postDeregister in interface javax.management.MBeanRegistration

getHostingHeritrix

protected Heritrix getHostingHeritrix()
Returns:
Heritrix that is hosting this job.

getJmxJobName

public java.lang.String getJmxJobName()
Returns:
Unique name for job that is safe to use in jmx (Like display name but without spaces).

getNotificationsSequenceNumber

protected static int getNotificationsSequenceNumber()
Returns:
Notification sequence number (Does increment after each access).

getMbeanName

protected javax.management.ObjectName getMbeanName()

getStatisticsTracking

public StatisticsTracking getStatisticsTracking()
Returns:
the statistics tracking instance (of null if none yet available).


Copyright © 2003-2011 Internet Archive. All Rights Reserved.