|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.crawler.admin.CrawlJobHandler
public class CrawlJobHandler
This class manages CrawlJobs. Submitted crawl jobs are queued up and run in order when the crawler is running.
Basically this provides a layer between any potential user interface and the CrawlJobs. It keeps the lists of completed jobs, pending jobs, etc.
The jobs managed by the handler can be divided into the following:
Pending
- Jobs that are ready to run and are waiting their
turn. These can be edited, viewed, deleted etc.
Running
- Only one job can be running at a time. There may
be no job running. The running job can be viewed
and edited to some extent. It can also be
terminated. This job should have a
StatisticsTracking module attached to it for more
details on the crawl.
Completed
- Jobs that have finished crawling or have been
deleted from the pending queue or terminated
while running. They can not be edited but can be
viewed. They retain the StatisticsTracking
module from their run.
New job
- At any given time their can be one 'new job' the
new job is not considered ready to run. It can
be edited or discarded (in which case it will be
totally destroyed, including any files on disk).
Once an operator deems the job ready to run it
can be moved to the pending queue.
Profiles
- Jobs under profiles are not actual jobs. They can
be edited normally but can not be submitted to
the pending queue. New jobs can be created
using a profile as it's template.
CrawlJob
Field Summary | |
---|---|
static java.lang.String |
DEFAULT_PROFILE
Default profile name. |
static java.lang.String |
DEFAULT_PROFILE_NAME
Name of system property whose specification overrides default profile used. |
static java.lang.String |
ORDER_FILE_NAME
|
static java.lang.String |
PROFILES_DIR_NAME
Name of the profiles directory. |
static java.lang.String |
RECOVER_LOG
String to indicate recovery should be based on the recovery log, not based on checkpointing. |
Constructor Summary | |
---|---|
CrawlJobHandler(java.io.File jobsDir)
Constructor. |
|
CrawlJobHandler(java.io.File jobsDir,
boolean loadJobs,
boolean loadProfiles)
Constructor allowing for optional loading of profiles and jobs. |
Method Summary | |
---|---|
CrawlJob |
addJob(CrawlJob job)
Submit a job to the handler. |
void |
addProfile(CrawlJob profile)
Add a new profile |
protected void |
checkDirectory(java.io.File dir)
|
void |
checkpointJob()
Cause the current job to write a checkpoint to disk. |
void |
crawlCheckpoint(java.io.File checkpointDir)
Called by CrawlController when checkpointing. |
void |
crawlEnded(java.lang.String sExitMessage)
Called when a CrawlController has ended a crawl and is about to exit. |
void |
crawlEnding(java.lang.String sExitMessage)
Called when a CrawlController is ending a crawl (for any reason) |
void |
crawlPaused(java.lang.String statusMessage)
Called when a CrawlController is actually paused (all threads are idle). |
void |
crawlPausing(java.lang.String statusMessage)
Called when a CrawlController is going to be paused. |
void |
crawlResuming(java.lang.String statusMessage)
Called when a CrawlController is resuming a crawl that had been paused. |
void |
crawlStarted(java.lang.String message)
Called on crawl start. |
protected CrawlJob |
createNewJob(java.io.File orderFile,
java.lang.String name,
java.lang.String description,
java.lang.String seeds,
int priority)
|
protected XMLSettingsHandler |
createSettingsHandler(java.io.File orderFile,
java.lang.String name,
java.lang.String description,
java.lang.String seeds,
java.io.File newSettingsDir,
CrawlJobErrorHandler errorHandler,
java.lang.String filename,
java.lang.String seedfile)
Creates a new settings handler based on an existing job. |
void |
deleteJob(java.lang.String jobUID)
The specified job will be removed from the pending queue or aborted if currently running. |
void |
deleteProfile(CrawlJob cj)
|
long |
deleteURIsFromPending(java.lang.String regexpr)
Delete any URI from the frontier of the current (paused) job that match the specified regular expression. |
long |
deleteURIsFromPending(java.lang.String uriPattern,
java.lang.String queuePattern)
Delete any URI from the frontier of the current (paused) job that match the specified regular expression. |
void |
discardNewJob()
Discard the handler's 'new job'. |
protected void |
doFlush()
If its a HostQueuesFrontier, needs to be flushed for the queued. |
static CrawlJob |
ensureNewJobWritten(CrawlJob newJob,
java.lang.String metaname,
java.lang.String description)
Ensure order file with new name/desc is written. |
java.util.List<CrawlJob> |
getCompletedJobs()
|
CrawlJob |
getCurrentJob()
|
CrawlJob |
getDefaultProfile()
Returns the default profile. |
FrontierMarker |
getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
Returns a URIFrontierMarker for the current, paused, job. |
CrawlJob |
getJob(java.lang.String jobUID)
Return a job with the given UID. |
CrawlJob |
getNewJob()
Get the handler's 'new job' |
java.lang.String |
getNextJobUID()
Returns a unique job ID. |
java.util.List<CrawlJob> |
getPendingJobs()
A List of all pending jobs |
java.util.ArrayList |
getPendingURIsList(FrontierMarker marker,
int numberOfMatches,
boolean verbose)
Returns the frontiers URI list based on the provided marker. |
java.util.List<CrawlJob> |
getProfiles()
Returns a List of all known profiles. |
protected java.io.File |
getStateJobFile(java.io.File jobDir)
Find the state.job file in the job directory. |
void |
importUri(java.lang.String uri,
boolean forceFetch,
boolean isSeed)
Schedule a uri. |
void |
importUri(java.lang.String str,
boolean forceFetch,
boolean isSeed,
boolean isFlush)
Schedule a uri. |
protected int |
importUris(java.io.InputStream is,
java.lang.String style,
boolean forceRevisit)
|
java.lang.String |
importUris(java.lang.String fileOrUrl,
java.lang.String style,
boolean forceRevisit)
|
java.lang.String |
importUris(java.lang.String file,
java.lang.String style,
java.lang.String force)
|
boolean |
isCrawling()
Is a crawl job being crawled? |
boolean |
isRunning()
Is the crawler accepting crawl jobs to run? |
void |
kickUpdate()
Forward a 'kick' update to current job if any. |
protected void |
loadJob(java.io.File job)
Loads a job given a specific job file. |
static java.util.ArrayList<java.lang.String> |
loadOptions(java.lang.String file)
Loads options from a file. |
protected boolean |
loadProfile(java.io.File profile)
Load one profile. |
CrawlJob |
newJob(CrawlJob baseOn,
java.lang.String recovery,
java.lang.String name,
java.lang.String description,
java.lang.String seeds,
int priority)
Creates a new job. |
CrawlJob |
newJob(java.io.File orderFile,
java.lang.String name,
java.lang.String description,
java.lang.String seeds)
Creates a new job. |
CrawlJob |
newProfile(CrawlJob baseOn,
java.lang.String name,
java.lang.String description,
java.lang.String seeds)
Creates a new profile. |
void |
pauseJob()
Cause the current job to pause. |
void |
requestCrawlStop()
|
void |
resumeJob()
Cause the current job to resume crawling if it was paused. |
void |
setDefaultProfile(CrawlJob profile)
Set the default profile. |
void |
startCrawler()
Allow jobs to be crawled. |
protected void |
startNextJob()
Start next crawl job. |
protected void |
startNextJobInternal()
|
void |
stop()
|
void |
stopCrawler()
Stop future jobs from being crawled. |
boolean |
terminateCurrentJob()
|
protected void |
updateRecoveryPaths(java.io.File recover,
SettingsHandler sh,
java.lang.String jobName)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String DEFAULT_PROFILE_NAME
public static final java.lang.String DEFAULT_PROFILE
public static final java.lang.String PROFILES_DIR_NAME
public static final java.lang.String ORDER_FILE_NAME
public static final java.lang.String RECOVER_LOG
Constructor Detail |
---|
public CrawlJobHandler(java.io.File jobsDir)
jobsDir
- Jobs directory.public CrawlJobHandler(java.io.File jobsDir, boolean loadJobs, boolean loadProfiles)
jobsDir
- Jobs directory.loadJobs
- If true then any applicable jobs will be loaded.loadProfiles
- If true then any applicable profiles will be loaded.Method Detail |
---|
protected java.io.File getStateJobFile(java.io.File jobDir)
jobDir
- Directory to look in.
protected void loadJob(java.io.File job)
job
- The job file of the job to load.protected boolean loadProfile(java.io.File profile)
profile
- Profile to load.
public void addProfile(CrawlJob profile)
profile
- The new profilepublic void deleteProfile(CrawlJob cj) throws java.io.IOException
java.io.IOException
public java.util.List<CrawlJob> getProfiles()
public CrawlJob addJob(CrawlJob job)
job
- A new job for the handler
public CrawlJob getDefaultProfile()
public void setDefaultProfile(CrawlJob profile)
profile
- The new default profile. The following must apply to it.
profile.isProfile() should return true and
this.getProfiles() should contain it.public java.util.List<CrawlJob> getPendingJobs()
public CrawlJob getCurrentJob()
public java.util.List<CrawlJob> getCompletedJobs()
public CrawlJob getJob(java.lang.String jobUID)
jobUID
- The unique ID of the job.
public boolean terminateCurrentJob()
public void deleteJob(java.lang.String jobUID)
jobUID
- The UID (unique ID) of the job that is to be deleted.public void pauseJob()
public void resumeJob()
public void checkpointJob() throws java.lang.IllegalStateException
java.lang.IllegalStateException
- Thrown if crawl is not paused.public java.lang.String getNextJobUID()
No two calls to this method (on the same instance of this class) can ever
return the same value.
Currently implemented to return a time stamp. That is subject to change
though.
ArchiveUtils.TIMESTAMP17
public CrawlJob newJob(CrawlJob baseOn, java.lang.String recovery, java.lang.String name, java.lang.String description, java.lang.String seeds, int priority) throws FatalConfigurationException
baseOn
- A CrawlJob (with a valid settingshandler) to use as the
template for the new job.recovery
- Whether to preinitialize new job as recovery of
baseOn
job. String holds RECOVER_LOG if we are to
do the recovery based off the recover.gz log -- See RecoveryJournal in
the frontier package -- or it holds the name of
the checkpoint we're to use recoverying.name
- The name of the new job.description
- Descriptions of the job.seeds
- The contents of the new settings' seed file.priority
- The priority of the new job.
FatalConfigurationException
- If a problem occurs creating the
settings.public CrawlJob newJob(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds) throws FatalConfigurationException
orderFile
- Order file to use as the template for the new job.name
- The name of the new job.description
- Descriptions of the job.seeds
- The contents of the new settings' seed file.
FatalConfigurationException
- If a problem occurs creating the
settings.protected void checkDirectory(java.io.File dir) throws FatalConfigurationException
FatalConfigurationException
protected CrawlJob createNewJob(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds, int priority) throws FatalConfigurationException
FatalConfigurationException
public CrawlJob newProfile(CrawlJob baseOn, java.lang.String name, java.lang.String description, java.lang.String seeds) throws FatalConfigurationException, java.io.IOException
baseOn
- A CrawlJob (with a valid settingshandler) to use as the
template for the new profile.name
- The name of the new profile.description
- Description of the new profileseeds
- The contents of the new profiles' seed file
FatalConfigurationException
java.io.IOException
protected XMLSettingsHandler createSettingsHandler(java.io.File orderFile, java.lang.String name, java.lang.String description, java.lang.String seeds, java.io.File newSettingsDir, CrawlJobErrorHandler errorHandler, java.lang.String filename, java.lang.String seedfile) throws FatalConfigurationException
orderFile
- Order file to base new order file on. Cannot be null.name
- Name for the new settingsdescription
- Description of the new settings.seeds
- The contents of the new settings' seed file.newSettingsDir
- errorHandler
- filename
- Name of new order file.seedfile
- Name of new seeds file.
FatalConfigurationException
- If there are problems with reading the 'base on'
configuration, with writing the new configuration or it's
seed file.protected void updateRecoveryPaths(java.io.File recover, SettingsHandler sh, java.lang.String jobName) throws FatalConfigurationException
recover
- Source to use recovering. Can be full path to a recovery log
or full path to a checkpoint src dir.sh
- Settings Handler to update.jobName
- Name of this job.
FatalConfigurationException
public void discardNewJob()
public CrawlJob getNewJob()
public boolean isRunning()
public boolean isCrawling()
public void startCrawler()
public void stopCrawler()
protected final void startNextJob()
protected void startNextJobInternal()
public void kickUpdate()
public static java.util.ArrayList<java.lang.String> loadOptions(java.lang.String file) throws java.io.IOException
Options are loaded from the CLASSPATH.
file
- the name of the option file (without path!)
java.io.IOException
- when there is trouble reading the file.public FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
regexpr
- A regular expression that each URI must match in order to be
considered 'within' the marker.inCacheOnly
- Limit marker scope to 'cached' URIs.
getPendingURIsList(FrontierMarker, int, boolean)
,
Frontier.getInitialMarker(String,
boolean)
,
FrontierMarker
public java.util.ArrayList getPendingURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) throws InvalidFrontierMarkerException
marker
- URIFrontier markernumberOfMatches
- maximum number of matches to returnverbose
- should detailed info be provided on each URI?
InvalidFrontierMarkerException
- When marker is inconsistent with the current state of the
frontier.getInitialMarker(String, boolean)
,
FrontierMarker
public long deleteURIsFromPending(java.lang.String regexpr)
regexpr
- Regular expression to delete URIs by.
public long deleteURIsFromPending(java.lang.String uriPattern, java.lang.String queuePattern)
uriPattern
- Regular expression to delete URIs by.queuePattern
- Regular expression of target queues (or null for all)
public java.lang.String importUris(java.lang.String file, java.lang.String style, java.lang.String force)
public java.lang.String importUris(java.lang.String fileOrUrl, java.lang.String style, boolean forceRevisit)
fileOrUrl
- Name of file w/ seeds.style
- What style of seeds -- crawl log (crawlLog
style) or recovery journal (recoveryJournal
style), or
seeds file style (Pass default
style).forceRevisit
- Should we revisit even if seen before?
protected int importUris(java.io.InputStream is, java.lang.String style, boolean forceRevisit)
public void importUri(java.lang.String uri, boolean forceFetch, boolean isSeed) throws org.apache.commons.httpclient.URIException
uri
- Uri to schedule.forceFetch
- Should it be forcefetched.isSeed
- True if seed.
org.apache.commons.httpclient.URIException
public void importUri(java.lang.String str, boolean forceFetch, boolean isSeed, boolean isFlush) throws org.apache.commons.httpclient.URIException
str
- String that can be: 1. a UURI, 2. a snippet of the
crawl.log line, or 3. a snippet from recover log. See
importUris(InputStream, String, boolean)
for how it subparses
the lines from crawl.log and recover.log.forceFetch
- Should it be forcefetched.isSeed
- True if seed.isFlush
- If true, flush the frontier IF it implements
flushing.
org.apache.commons.httpclient.URIException
protected void doFlush()
public void stop()
public void requestCrawlStop()
public static CrawlJob ensureNewJobWritten(CrawlJob newJob, java.lang.String metaname, java.lang.String description)
newJob
- Newly created job.metaname
- Metaname for new job.description
- Description for new job.
newJob
public void crawlStarted(java.lang.String message)
CrawlStatusListener
crawlStarted
in interface CrawlStatusListener
message
- Start message.public void crawlEnding(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnding
in interface CrawlStatusListener
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded
in interface CrawlStatusListener
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlPausing(java.lang.String statusMessage)
CrawlStatusListener
crawlPausing
in interface CrawlStatusListener
statusMessage
- Should be
STATUS_WAITING_FOR_PAUSE
. Passed for conveniencepublic void crawlPaused(java.lang.String statusMessage)
CrawlStatusListener
crawlPaused
in interface CrawlStatusListener
statusMessage
- Should be
CrawlJob.STATUS_PAUSED
. Passed for
conveniencepublic void crawlResuming(java.lang.String statusMessage)
CrawlStatusListener
crawlResuming
in interface CrawlStatusListener
statusMessage
- Should be
CrawlJob.STATUS_RUNNING
. Passed for
conveniencepublic void crawlCheckpoint(java.io.File checkpointDir) throws java.lang.Exception
CrawlStatusListener
CrawlController
when checkpointing.
crawlCheckpoint
in interface CrawlStatusListener
checkpointDir
- Checkpoint dir. Write checkpoint state here.
java.lang.Exception
- A fatal exception. Any exceptions
that are let out of this checkpoint are assumed fatal
and terminate further checkpoint processing.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |