CrawlJobHandler (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.admin
Class CrawlJobHandler

java.lang.Object
  org.archive.crawler.admin.CrawlJobHandler

All Implemented Interfaces:: CrawlStatusListener

Direct Known Subclasses:: SelfTestCrawlJobHandler

public class CrawlJobHandler
extends java.lang.Object
implements CrawlStatusListener
extends java.lang.Object
implements CrawlStatusListener

This class manages CrawlJobs. Submitted crawl jobs are queued up and run in order when the crawler is running.

Basically this provides a layer between any potential user interface and the CrawlJobs. It keeps the lists of completed jobs, pending jobs, etc.

The jobs managed by the handler can be divided into the following:

Pending - Jobs that are ready to run and are waiting their turn. These can be edited, viewed, deleted etc.
Running - Only one job can be running at a time. There may be no job running. The running job can be viewed and edited to some extent. It can also be terminated. This job should have a StatisticsTracking module attached to it for more details on the crawl.
Completed - Jobs that have finished crawling or have been deleted from the pending queue or terminated while running. They can not be edited but can be viewed. They retain the StatisticsTracking module from their run.
New job - At any given time their can be one 'new job' the new job is not considered ready to run. It can be edited or discarded (in which case it will be totally destroyed, including any files on disk). Once an operator deems the job ready to run it can be moved to the pending queue.

Profiles - Jobs under profiles are not actual jobs. They can be edited normally but can not be submitted to the pending queue. New jobs can be created using a profile as it's template.

Author:: Kristinn Sigurdsson
See Also:: CrawlJob

Field Summary
`static java.lang.String`	`DEFAULT_PROFILE` Default profile name.
`static java.lang.String`	`DEFAULT_PROFILE_NAME` Name of system property whose specification overrides default profile used.
`static java.lang.String`	`ORDER_FILE_NAME`
`static java.lang.String`	`PROFILES_DIR_NAME` Name of the profiles directory.
`static java.lang.String`	`RECOVER_LOG` String to indicate recovery should be based on the recovery log, not based on checkpointing.

Constructor Summary
`CrawlJobHandler(java.io.File jobsDir)` Constructor.
`CrawlJobHandler(java.io.File jobsDir, boolean loadJobs, boolean loadProfiles)` Constructor allowing for optional loading of profiles and jobs.

Method Summary
`CrawlJob`	`addJob(CrawlJob job)` Submit a job to the handler.
`void`	`addProfile(CrawlJob profile)` Add a new profile
`protected void`	`checkDirectory(java.io.File dir)`
`void`	`checkpointJob()` Cause the current job to write a checkpoint to disk.
`void`	`crawlCheckpoint(java.io.File checkpointDir)` Called by `CrawlController` when checkpointing.
`void`	`crawlEnded(java.lang.String sExitMessage)` Called when a CrawlController has ended a crawl and is about to exit.
`void`	`crawlEnding(java.lang.String sExitMessage)` Called when a CrawlController is ending a crawl (for any reason)
`void`	`crawlPaused(java.lang.String statusMessage)` Called when a CrawlController is actually paused (all threads are idle).
`void`	`crawlPausing(java.lang.String statusMessage)` Called when a CrawlController is going to be paused.
`void`	`crawlResuming(java.lang.String statusMessage)` Called when a CrawlController is resuming a crawl that had been paused.
`void`	`crawlStarted(java.lang.String message)` Called on crawl start.
`protected CrawlJob`	`createNewJob(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds, int priority)`
`protected XMLSettingsHandler`	`createSettingsHandler(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds, java.io.File newSettingsDir, CrawlJobErrorHandler errorHandler, java.lang.String filename, java.lang.String seedfile)` Creates a new settings handler based on an existing job.
`void`	`deleteJob(java.lang.String jobUID)` The specified job will be removed from the pending queue or aborted if currently running.
`void`	`deleteProfile(CrawlJob cj)`
`long`	`deleteURIsFromPending(java.lang.String regexpr)` Delete any URI from the frontier of the current (paused) job that match the specified regular expression.
`long`	`deleteURIsFromPending(java.lang.String uriPattern, java.lang.String queuePattern)` Delete any URI from the frontier of the current (paused) job that match the specified regular expression.
`void`	`discardNewJob()` Discard the handler's 'new job'.
`protected void`	`doFlush()` If its a HostQueuesFrontier, needs to be flushed for the queued.
`static CrawlJob`	`ensureNewJobWritten(CrawlJob newJob, java.lang.String metaname, java.lang.String description)` Ensure order file with new name/desc is written.
`java.util.List<CrawlJob>`	`getCompletedJobs()`
`CrawlJob`	`getCurrentJob()`
`CrawlJob`	`getDefaultProfile()` Returns the default profile.
`FrontierMarker`	`getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)` Returns a URIFrontierMarker for the current, paused, job.
`CrawlJob`	`getJob(java.lang.String jobUID)` Return a job with the given UID.
`CrawlJob`	`getNewJob()` Get the handler's 'new job'
`java.lang.String`	`getNextJobUID()` Returns a unique job ID.
`java.util.List<CrawlJob>`	`getPendingJobs()` A List of all pending jobs
`java.util.ArrayList`	`getPendingURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)` Returns the frontiers URI list based on the provided marker.
`java.util.List<CrawlJob>`	`getProfiles()` Returns a List of all known profiles.
`protected java.io.File`	`getStateJobFile(java.io.File jobDir)` Find the state.job file in the job directory.
`void`	`importUri(java.lang.String uri, boolean forceFetch, boolean isSeed)` Schedule a uri.
`void`	`importUri(java.lang.String str, boolean forceFetch, boolean isSeed, boolean isFlush)` Schedule a uri.
`protected int`	`importUris(java.io.InputStream is, java.lang.String style, boolean forceRevisit)`
`java.lang.String`	`importUris(java.lang.String fileOrUrl, java.lang.String style, boolean forceRevisit)`
`java.lang.String`	`importUris(java.lang.String file, java.lang.String style, java.lang.String force)`
`boolean`	`isCrawling()` Is a crawl job being crawled?
`boolean`	`isRunning()` Is the crawler accepting crawl jobs to run?
`void`	`kickUpdate()` Forward a 'kick' update to current job if any.
`protected void`	`loadJob(java.io.File job)` Loads a job given a specific job file.
`static java.util.ArrayList<java.lang.String>`	`loadOptions(java.lang.String file)` Loads options from a file.
`protected boolean`	`loadProfile(java.io.File profile)` Load one profile.
`CrawlJob`	`newJob(CrawlJob baseOn, java.lang.String recovery, java.lang.String name, java.lang.String description, java.lang.String seeds, int priority)` Creates a new job.
`CrawlJob`	`newJob(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds)` Creates a new job.
`CrawlJob`	`newProfile(CrawlJob baseOn, java.lang.String name, java.lang.String description, java.lang.String seeds)` Creates a new profile.
`void`	`pauseJob()` Cause the current job to pause.
`void`	`requestCrawlStop()`
`void`	`resumeJob()` Cause the current job to resume crawling if it was paused.
`void`	`setDefaultProfile(CrawlJob profile)` Set the default profile.
`void`	`startCrawler()` Allow jobs to be crawled.
`protected void`	`startNextJob()` Start next crawl job.
`protected void`	`startNextJobInternal()`
`void`	`stop()`
`void`	`stopCrawler()` Stop future jobs from being crawled.
`boolean`	`terminateCurrentJob()`
`protected void`	`updateRecoveryPaths(java.io.File recover, SettingsHandler sh, java.lang.String jobName)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

DEFAULT_PROFILE_NAME

public static final java.lang.String DEFAULT_PROFILE_NAME

Name of system property whose specification overrides default profile used.

See Also:: Constant Field Values

DEFAULT_PROFILE

public static final java.lang.String DEFAULT_PROFILE

Default profile name.

See Also:: Constant Field Values

PROFILES_DIR_NAME

public static final java.lang.String PROFILES_DIR_NAME

Name of the profiles directory.

See Also:: Constant Field Values

ORDER_FILE_NAME

public static final java.lang.String ORDER_FILE_NAME

See Also:: Constant Field Values

RECOVER_LOG

public static final java.lang.String RECOVER_LOG

String to indicate recovery should be based on the recovery log, not based on checkpointing.

See Also:: Constant Field Values

Constructor Detail

CrawlJobHandler

public CrawlJobHandler(java.io.File jobsDir)

Constructor.

Parameters:: jobsDir - Jobs directory.

CrawlJobHandler

public CrawlJobHandler(java.io.File jobsDir,
                       boolean loadJobs,
                       boolean loadProfiles)

Constructor allowing for optional loading of profiles and jobs.

Parameters:: jobsDir - Jobs directory.; loadJobs - If true then any applicable jobs will be loaded.; loadProfiles - If true then any applicable profiles will be loaded.

Method Detail

getStateJobFile

protected java.io.File getStateJobFile(java.io.File jobDir)

Find the state.job file in the job directory.

Parameters:: jobDir - Directory to look in.
Returns:: Full path to 'state.job' file or null if none found.

loadJob

protected void loadJob(java.io.File job)

Loads a job given a specific job file. The loaded job will be placed in the list of completed jobs or pending queue depending on its status. Running jobs will have their status set to 'finished abnormally' and put into the completed list.

Parameters:: job - The job file of the job to load.

loadProfile

protected boolean loadProfile(java.io.File profile)

Load one profile.

Parameters:: profile - Profile to load.
Returns:: True if loaded profile was the default profile.

addProfile

public void addProfile(CrawlJob profile)

Add a new profile

Parameters:: profile - The new profile

deleteProfile

public void deleteProfile(CrawlJob cj)
                   throws java.io.IOException

Throws:: java.io.IOException

getProfiles

public java.util.List<CrawlJob> getProfiles()

Returns a List of all known profiles.

Returns:: a List of all known profiles.

addJob

public CrawlJob addJob(CrawlJob job)

Submit a job to the handler. Job will be scheduled for crawling. At present it will not take the job's priority into consideration.

Parameters:: job - A new job for the handler
Returns:: CrawlJob that was added or null.

getDefaultProfile

public CrawlJob getDefaultProfile()

Returns the default profile. If no default profile has been set it will return the first profile that was set/loaded and still exists. If no profiles exist it will return null

Returns:: the default profile.

setDefaultProfile

public void setDefaultProfile(CrawlJob profile)

Set the default profile.

Parameters:: profile - The new default profile. The following must apply to it. profile.isProfile() should return true and this.getProfiles() should contain it.

getPendingJobs

public java.util.List<CrawlJob> getPendingJobs()

A List of all pending jobs

Returns:: A List of all pending jobs. No promises are made about the order of the list

getCurrentJob

public CrawlJob getCurrentJob()

Returns:: The job currently being crawled.

getCompletedJobs

public java.util.List<CrawlJob> getCompletedJobs()

Returns:: A List of all finished jobs.

getJob

public CrawlJob getJob(java.lang.String jobUID)

Return a job with the given UID. Doesn't matter if it's pending, currently running, has finished running is new or a profile.

Parameters:: jobUID - The unique ID of the job.
Returns:: The job with the UID or null if no such job is found

terminateCurrentJob

public boolean terminateCurrentJob()

Returns:: True if we terminated a current job (False if no job to terminate)

deleteJob

public void deleteJob(java.lang.String jobUID)

The specified job will be removed from the pending queue or aborted if currently running. It will be placed in the list of completed jobs with appropriate status info. If the job is already in the completed list or no job with the given UID is found, no action will be taken.

Parameters:: jobUID - The UID (unique ID) of the job that is to be deleted.

pauseJob

public void pauseJob()

Cause the current job to pause. If no current job is crawling this method will have no effect.

resumeJob

public void resumeJob()

Cause the current job to resume crawling if it was paused. Will have no effect if the current job was not paused or if there is no current job. If the current job is still waiting to pause, this will not take effect until the job has actually paused. At which time it will immeditatly resume crawling.

checkpointJob

public void checkpointJob()
                   throws java.lang.IllegalStateException

Cause the current job to write a checkpoint to disk. Currently requires job to already be paused.

Throws:: java.lang.IllegalStateException - Thrown if crawl is not paused.

getNextJobUID

public java.lang.String getNextJobUID()

Returns a unique job ID.

No two calls to this method (on the same instance of this class) can ever return the same value.
Currently implemented to return a time stamp. That is subject to change though.

Returns:: A unique job ID.
See Also:: ArchiveUtils.TIMESTAMP17

newJob

public CrawlJob newJob(CrawlJob baseOn,
                       java.lang.String recovery,
                       java.lang.String name,
                       java.lang.String description,
                       java.lang.String seeds,
                       int priority)
                throws FatalConfigurationException

Creates a new job. The new job will be returned and also registered as the handler's 'new job'. The new job will be based on the settings provided but created in a new location on disk.

Parameters:: baseOn - A CrawlJob (with a valid settingshandler) to use as the template for the new job.; recovery - Whether to preinitialize new job as recovery of baseOn job. String holds RECOVER_LOG if we are to do the recovery based off the recover.gz log -- See RecoveryJournal in the frontier package -- or it holds the name of the checkpoint we're to use recoverying.; name - The name of the new job.; description - Descriptions of the job.; seeds - The contents of the new settings' seed file.; priority - The priority of the new job.
Returns:: The new crawl job.
Throws:: FatalConfigurationException - If a problem occurs creating the settings.

newJob

public CrawlJob newJob(java.io.File orderFile,
                       java.lang.String name,
                       java.lang.String description,
                       java.lang.String seeds)
                throws FatalConfigurationException

Creates a new job. The new job will be returned and also registered as the handler's 'new job'. The new job will be based on the settings provided but created in a new location on disk.

Parameters:: orderFile - Order file to use as the template for the new job.; name - The name of the new job.; description - Descriptions of the job.; seeds - The contents of the new settings' seed file.
Returns:: The new crawl job.
Throws:: FatalConfigurationException - If a problem occurs creating the settings.

checkDirectory

protected void checkDirectory(java.io.File dir)
                       throws FatalConfigurationException

Throws:: FatalConfigurationException

createNewJob

protected CrawlJob createNewJob(java.io.File orderFile,
                                java.lang.String name,
                                java.lang.String description,
                                java.lang.String seeds,
                                int priority)
                         throws FatalConfigurationException

Throws:: FatalConfigurationException

newProfile

public CrawlJob newProfile(CrawlJob baseOn,
                           java.lang.String name,
                           java.lang.String description,
                           java.lang.String seeds)
                    throws FatalConfigurationException,
                           java.io.IOException

Creates a new profile. The new profile will be returned and also registered as the handler's 'new job'. The new profile will be based on the settings provided but created in a new location on disk.

Parameters:: baseOn - A CrawlJob (with a valid settingshandler) to use as the template for the new profile.; name - The name of the new profile.; description - Description of the new profile; seeds - The contents of the new profiles' seed file
Returns:: The new profile.
Throws:: FatalConfigurationException; java.io.IOException

createSettingsHandler

protected XMLSettingsHandler createSettingsHandler(java.io.File orderFile,
                                                   java.lang.String name,
                                                   java.lang.String description,
                                                   java.lang.String seeds,
                                                   java.io.File newSettingsDir,
                                                   CrawlJobErrorHandler errorHandler,
                                                   java.lang.String filename,
                                                   java.lang.String seedfile)
                                            throws FatalConfigurationException

Creates a new settings handler based on an existing job. Basically all the settings file for the 'based on' will be copied to the specified directory.

Parameters:: orderFile - Order file to base new order file on. Cannot be null.; name - Name for the new settings; description - Description of the new settings.; seeds - The contents of the new settings' seed file.; newSettingsDir -; errorHandler -; filename - Name of new order file.; seedfile - Name of new seeds file.
Returns:: The new settings handler.
Throws:: FatalConfigurationException - If there are problems with reading the 'base on' configuration, with writing the new configuration or it's seed file.

updateRecoveryPaths

protected void updateRecoveryPaths(java.io.File recover,
                                   SettingsHandler sh,
                                   java.lang.String jobName)
                            throws FatalConfigurationException

Parameters:: recover - Source to use recovering. Can be full path to a recovery log or full path to a checkpoint src dir.; sh - Settings Handler to update.; jobName - Name of this job.
Throws:: FatalConfigurationException

discardNewJob

public void discardNewJob()

Discard the handler's 'new job'. This will remove any files/directories written to disk.

getNewJob

public CrawlJob getNewJob()

Get the handler's 'new job'

Returns:: the handler's 'new job'

isRunning

public boolean isRunning()

Is the crawler accepting crawl jobs to run?

Returns:: True if the next availible CrawlJob will be crawled. False otherwise.

isCrawling

public boolean isCrawling()

Is a crawl job being crawled?

Returns:: True if a job is actually being crawled (even if it is paused). False if no job is being crawled.

startCrawler

public void startCrawler()

Allow jobs to be crawled.

stopCrawler

public void stopCrawler()

Stop future jobs from being crawled. This action will not affect the current job.

startNextJob

protected final void startNextJob()

Start next crawl job. If a is job already running this method will do nothing.

startNextJobInternal

protected void startNextJobInternal()

kickUpdate

public void kickUpdate()

Forward a 'kick' update to current job if any.

loadOptions

public static java.util.ArrayList<java.lang.String> loadOptions(java.lang.String file)
                                                         throws java.io.IOException

Loads options from a file. Typically these are a list of available modules that can be plugged into some part of the configuration. For examples Processors, Frontiers, Filters etc. Leading and trailing spaces are trimmed from each line.

Options are loaded from the CLASSPATH.

Parameters:: file - the name of the option file (without path!)
Returns:: The option file with each option line as a seperate entry in the ArrayList.
Throws:: java.io.IOException - when there is trouble reading the file.

getInitialMarker

public FrontierMarker getInitialMarker(java.lang.String regexpr,
                                       boolean inCacheOnly)

Returns a URIFrontierMarker for the current, paused, job. If there is no current job or it is not paused null will be returned.

Parameters:: regexpr - A regular expression that each URI must match in order to be considered 'within' the marker.; inCacheOnly - Limit marker scope to 'cached' URIs.
Returns:: a URIFrontierMarker for the current job.
See Also:: getPendingURIsList(FrontierMarker, int, boolean), Frontier.getInitialMarker(String, boolean), FrontierMarker

getPendingURIsList

public java.util.ArrayList getPendingURIsList(FrontierMarker marker,
                                              int numberOfMatches,
                                              boolean verbose)
                                       throws InvalidFrontierMarkerException

Returns the frontiers URI list based on the provided marker. This method will return null if there is not current job or if the current job is not paused. Only when there is a paused current job will this method return a URI list.

Parameters:: marker - URIFrontier marker; numberOfMatches - maximum number of matches to return; verbose - should detailed info be provided on each URI?
Returns:: the frontiers URI list based on the provided marker
Throws:: InvalidFrontierMarkerException - When marker is inconsistent with the current state of the frontier.
See Also:: getInitialMarker(String, boolean), FrontierMarker

deleteURIsFromPending

public long deleteURIsFromPending(java.lang.String regexpr)

Delete any URI from the frontier of the current (paused) job that match the specified regular expression. If the current job is not paused (or there is no current job) nothing will be done.

Parameters:: regexpr - Regular expression to delete URIs by.
Returns:: the number of URIs deleted

deleteURIsFromPending

public long deleteURIsFromPending(java.lang.String uriPattern,
                                  java.lang.String queuePattern)

Delete any URI from the frontier of the current (paused) job that match the specified regular expression. If the current job is not paused (or there is no current job) nothing will be done.

Parameters:: uriPattern - Regular expression to delete URIs by.; queuePattern - Regular expression of target queues (or null for all)
Returns:: the number of URIs deleted

importUris

public java.lang.String importUris(java.lang.String file,
                                   java.lang.String style,
                                   java.lang.String force)

importUris

public java.lang.String importUris(java.lang.String fileOrUrl,
                                   java.lang.String style,
                                   boolean forceRevisit)

Parameters:: fileOrUrl - Name of file w/ seeds.; style - What style of seeds -- crawl log (crawlLog style) or recovery journal (recoveryJournal style), or seeds file style (Pass default style).; forceRevisit - Should we revisit even if seen before?
Returns:: A display string that has a count of all added.

importUris

protected int importUris(java.io.InputStream is,
                         java.lang.String style,
                         boolean forceRevisit)

importUri

public void importUri(java.lang.String uri,
                      boolean forceFetch,
                      boolean isSeed)
               throws org.apache.commons.httpclient.URIException

Schedule a uri.

Parameters:: uri - Uri to schedule.; forceFetch - Should it be forcefetched.; isSeed - True if seed.
Throws:: org.apache.commons.httpclient.URIException

importUri

public void importUri(java.lang.String str,
                      boolean forceFetch,
                      boolean isSeed,
                      boolean isFlush)
               throws org.apache.commons.httpclient.URIException

Schedule a uri.

Parameters:: str - String that can be: 1. a UURI, 2. a snippet of the crawl.log line, or 3. a snippet from recover log. See importUris(InputStream, String, boolean) for how it subparses the lines from crawl.log and recover.log.; forceFetch - Should it be forcefetched.; isSeed - True if seed.; isFlush - If true, flush the frontier IF it implements flushing.
Throws:: org.apache.commons.httpclient.URIException

doFlush

protected void doFlush()

If its a HostQueuesFrontier, needs to be flushed for the queued.

stop

public void stop()

requestCrawlStop

public void requestCrawlStop()

ensureNewJobWritten

public static CrawlJob ensureNewJobWritten(CrawlJob newJob,
                                           java.lang.String metaname,
                                           java.lang.String description)

Ensure order file with new name/desc is written. See '[ 1066573 ] sometimes job based-on other job uses older job name'.

Parameters:: newJob - Newly created job.; metaname - Metaname for new job.; description - Description for new job.
Returns:: newJob