|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.crawler.framework.CrawlController
public class CrawlController
CrawlController collects all the classes which cooperate to perform a crawl and provides a high-level interface to the running crawl. As the "global context" for a crawl, subcomponents will often reach each other through the CrawlController.
Field Summary | |
---|---|
static java.lang.Object |
CHECKPOINTING
|
static java.lang.String |
CURRENT_LOG_SUFFIX
suffix to use on active logs |
static java.lang.Object |
FINISHED
|
java.util.logging.Logger |
localErrors
This logger is for job-scoped logging, specifically errors which happen and are handled within a particular processor. |
static java.lang.String |
LOGNAME_CRAWL
|
static java.lang.String |
LOGNAME_LOCAL_ERRORS
|
static java.lang.String |
LOGNAME_PROGRESS_STATISTICS
|
static java.lang.String |
LOGNAME_RUNTIME_ERRORS
|
static java.lang.String |
LOGNAME_URI_ERRORS
|
static char |
MANIFEST_CONFIG_FILE
abbrieviation label for config files in manifest |
static char |
MANIFEST_LOG_FILE
abbrieviation label for log files in manifest |
static java.lang.String |
MANIFEST_REPORT
|
static char |
MANIFEST_REPORT_FILE
abbrieviation label for report files in manifest |
static java.lang.Object |
NASCENT
|
static java.lang.Object |
PAUSED
|
static java.lang.Object |
PAUSING
|
static java.lang.Object |
PREPARING
|
static java.lang.String |
PROCESSORS_REPORT
|
protected java.util.ArrayList<CrawlURIDispositionListener> |
registeredCrawlURIDispositionListeners
|
java.util.logging.Logger |
reports
Logger to hold job summary report. |
protected static java.lang.String[] |
REPORTS
|
static java.lang.Object |
RUNNING
|
java.util.logging.Logger |
runtimeErrors
This logger contains unexpected runtime errors. |
static java.lang.Object |
STARTED
|
protected StatisticsTracking |
statistics
|
static java.lang.Object |
STOPPING
|
java.util.logging.Logger |
uriErrors
Special log for URI format problems, wherever they may occur. |
java.util.logging.Logger |
uriProcessing
Crawl progress logger. |
Constructor Summary | |
---|---|
CrawlController()
Default constructor |
Method Summary | ||
---|---|---|
void |
acquireContinuePermission()
Proceed only if allowed, giving CrawlController a chance to enforce single-thread mode. |
|
void |
addCrawlStatusListener(CrawlStatusListener cl)
Register for CrawlStatus events. |
|
void |
addCrawlURIDispositionListener(CrawlURIDispositionListener cl)
Register for CrawlURIDisposition events. |
|
void |
addOrderToManifest()
Add order file contents to manifest. |
|
void |
addToManifest(java.lang.String file,
char type,
boolean bundle)
Add a file to the manifest of files used/generated by the current crawl. |
|
boolean |
atFinish()
Evaluate if the crawl should stop because it is finished, without actually stopping the crawl. |
|
void |
beginCrawlStop()
Start the process of stopping the crawl. |
|
void |
checkFinish()
Evaluate if the crawl should stop because it is finished. |
|
(package private) void |
checkpoint()
Run checkpointing. |
|
protected void |
checkpointBdb(java.io.File checkpointDir)
Checkpoint bdb. |
|
protected void |
checkpointBigMaps(java.io.File cpDir)
|
|
void |
closeLogFiles()
Close all log files and remove handlers from loggers. |
|
(package private) void |
completePause()
|
|
protected void |
completeStop()
Called when the last toethread exits. |
|
protected FatalConfigurationException |
convertToFatalConfigurationException(java.lang.Exception e)
|
|
protected void |
copySettings(java.io.File checkpointDir)
Copy off the settings. |
|
void |
fireCrawledURIDisregardEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURIDisregard event that will be broadcast to all listeners that have registered with the CrawlController. |
|
void |
fireCrawledURIFailureEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURIFailure event that will be broadcast to all listeners that have registered with the CrawlController. |
|
void |
fireCrawledURINeedRetryEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURINeedRetry event that will be broadcast to all listeners that have registered with the CrawlController. |
|
void |
fireCrawledURISuccessfulEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURISuccessful event that will be broadcast to all listeners that have registered with the CrawlController. |
|
void |
freeReserveMemory()
|
|
int |
getActiveToeCount()
|
|
EnhancedEnvironment |
getBdbEnvironment()
|
|
protected java.lang.String |
getBdbLogFileName(long index)
|
|
|
getBigMap(java.lang.String dbName,
java.lang.Class<? super V> valueClass)
Call this method to get instance of the crawler BigMap implementation. |
|
protected
|
getCBM(java.lang.String dbName,
java.lang.Class<? super V> valueClass)
Deprecated. |
|
protected boolean |
getCheckpointCopyBdbjeLogs()
|
|
Checkpoint |
getCheckpointRecover()
Get recover checkpoint. |
|
static Checkpoint |
getCheckpointRecover(CrawlOrder order)
|
|
java.io.File |
getCheckpointsDisk()
|
|
com.sleepycat.bind.serial.StoredClassCatalog |
getClassCatalog()
Deprecated. use EnhancedEnvironment's getClassCatalog() instead |
|
java.io.File |
getDisk()
Get the 'working' directory of the current crawl. |
|
ProcessorChain |
getFirstProcessorChain()
Get the first processor chain. |
|
Frontier |
getFrontier()
|
|
java.io.File |
getLogsDir()
|
|
java.util.concurrent.atomic.AtomicInteger |
getLoopingToes()
|
|
protected
|
getOIBC(java.lang.String dbName,
java.lang.Class<? super V> valueClass)
Implement 'big map' with ObjectIdentityBdbCache. |
|
CrawlOrder |
getOrder()
|
|
ProcessorChain |
getPostprocessorChain()
Get the postprocessor chain. |
|
ProcessorChainList |
getProcessorChainList()
Get the list of processor chains. |
|
java.lang.String[] |
getReports()
Get an array of report names offered by this Reporter. |
|
CrawlScope |
getScope()
|
|
java.io.File |
getScratchDisk()
|
|
ServerCache |
getServerCache()
|
|
java.io.File |
getSettingsDir(java.lang.String key)
Return fullpath to the directory named by key
in settings. |
|
SettingsHandler |
getSettingsHandler()
|
|
java.lang.Object |
getState()
|
|
java.io.File |
getStateDisk()
|
|
StatisticsTracking |
getStatistics()
|
|
int |
getToeCount()
|
|
ToePool |
getToePool()
|
|
void |
initialize(SettingsHandler sH)
Starting from nothing, set up CrawlController and associated classes to be ready for a first crawl. |
|
void |
installThreadContextSettingsHandler()
Utility method to install this crawl's SettingsHandler into the 'global' (for this thread) holder, so that any subsequent deserialization operations in this thread can find it. |
|
boolean |
isCheckpointing()
|
|
boolean |
isCheckpointRecover()
|
|
static boolean |
isCheckpointRecover(CrawlOrder order)
|
|
boolean |
isPaused()
Tell if the controller is paused |
|
boolean |
isPausing()
|
|
boolean |
isRunning()
|
|
void |
kickUpdate()
While many settings will update automatically when the SettingsHandler is modified, some settings need to be explicitly changed to reflect new settings. |
|
void |
killThread(int threadNumber,
boolean replace)
Kills a thread. |
|
void |
logProgressStatistics(java.lang.String msg)
Log to the progress statistics log. |
|
void |
logUriError(org.apache.commons.httpclient.URIException e,
UURI u,
java.lang.CharSequence l)
Log a URIException from deep inside other components to the crawl's shared log. |
|
void |
multiThreadMode()
Go to back to regular multi thread mode, where all ToeThreads may proceed at once |
|
java.lang.String |
oneLineReportThreads()
|
|
protected void |
processBdbLogs(java.io.File checkpointDir,
java.lang.String lastBdbCheckpointLog)
|
|
void |
progressStatisticsEvent(java.util.EventObject e)
Called whenever progress statistics logging event. |
|
void |
releaseContinuePermission()
Relinquish continue permission at end of processing (allowing another thread to proceed if in single-thread mode). |
|
protected void |
reportManifestTo(java.io.PrintWriter writer)
|
|
protected void |
reportProcessorsTo(java.io.PrintWriter writer)
Compiles and returns a human readable report on the active processors. |
|
void |
reportTo(java.io.PrintWriter writer)
Make a default report to the passed-in Writer. |
|
void |
reportTo(java.lang.String name,
java.io.PrintWriter writer)
Make a report of the given name to the passed-in Writer, If null, give the default report. |
|
void |
requestCrawlCheckpoint()
Request a checkpoint. |
|
void |
requestCrawlPause()
Stop the crawl temporarly. |
|
void |
requestCrawlResume()
Resume crawl from paused state |
|
void |
requestCrawlStart()
Operator requested crawl begin |
|
void |
requestCrawlStop()
Operator requested for crawl to stop. |
|
void |
requestCrawlStop(java.lang.String message)
Operator requested for crawl to stop. |
|
protected void |
restoreStatisticsTracker(MapType loggers,
java.lang.String replaceName)
|
|
protected void |
rotateLogFiles(java.lang.String generationSuffix)
|
|
protected void |
runFrontierRecover(java.lang.String recoverPath)
|
|
protected void |
sendCheckpointEvent(java.io.File checkpointDir)
Send the checkpoint event. |
|
protected void |
sendCrawlStateChangeEvent(java.lang.Object newState,
java.lang.String message)
Send crawl change event to all listeners. |
|
protected void |
setBdbjeBkgrdThreads(com.sleepycat.je.EnvironmentConfig config,
java.util.List threads,
java.lang.String setting)
|
|
void |
setOrder(CrawlOrder o)
|
|
protected void |
setupCheckpointRecover()
Does setup of checkpoint recover. |
|
java.lang.String |
singleLineLegend()
Return a legend for the single-line summary report as a String. |
|
java.lang.String |
singleLineReport()
Return a short single-line summary report as a String. |
|
void |
singleLineReportTo(java.io.PrintWriter writer)
Make a single-line summary report to the passed-in writer |
|
void |
singleThreadMode()
Go to single thread mode, where only one ToeThread may proceed at a time. |
|
void |
toeEnded()
Note that a ToeThread ended, possibly completing the crawl-stop. |
|
void |
toePaused()
Note that a ToeThread reached paused condition, possibly completing the crawl-pause. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final char MANIFEST_CONFIG_FILE
public static final char MANIFEST_REPORT_FILE
public static final char MANIFEST_LOG_FILE
public static final java.lang.String LOGNAME_PROGRESS_STATISTICS
public static final java.lang.String LOGNAME_URI_ERRORS
public static final java.lang.String LOGNAME_RUNTIME_ERRORS
public static final java.lang.String LOGNAME_LOCAL_ERRORS
public static final java.lang.String LOGNAME_CRAWL
public static final java.lang.Object NASCENT
public static final java.lang.Object RUNNING
public static final java.lang.Object PAUSED
public static final java.lang.Object PAUSING
public static final java.lang.Object CHECKPOINTING
public static final java.lang.Object STOPPING
public static final java.lang.Object FINISHED
public static final java.lang.Object STARTED
public static final java.lang.Object PREPARING
public static final java.lang.String CURRENT_LOG_SUFFIX
public transient java.util.logging.Logger uriProcessing
public transient java.util.logging.Logger runtimeErrors
public transient java.util.logging.Logger localErrors
public transient java.util.logging.Logger uriErrors
public transient java.util.logging.Logger reports
protected StatisticsTracking statistics
protected transient java.util.ArrayList<CrawlURIDispositionListener> registeredCrawlURIDispositionListeners
public static final java.lang.String PROCESSORS_REPORT
public static final java.lang.String MANIFEST_REPORT
protected static final java.lang.String[] REPORTS
Constructor Detail |
---|
public CrawlController()
Method Detail |
---|
public void initialize(SettingsHandler sH) throws InitializationException
sH
- Settings handler.
InitializationException
public void installThreadContextSettingsHandler()
sH
- protected void setupCheckpointRecover() throws java.io.IOException
java.io.IOException
protected boolean getCheckpointCopyBdbjeLogs()
public EnhancedEnvironment getBdbEnvironment()
public com.sleepycat.bind.serial.StoredClassCatalog getClassCatalog()
public void addCrawlStatusListener(CrawlStatusListener cl)
cl
- a class implementing the CrawlStatusListener interfaceCrawlStatusListener
public void addCrawlURIDispositionListener(CrawlURIDispositionListener cl)
cl
- a class implementing the CrawlURIDispostionListener interfaceCrawlURIDispositionListener
public void fireCrawledURISuccessfulEvent(CrawlURI curi)
curi
- - The CrawlURI that will be sent with the event notification.CrawlURIDispositionListener.crawledURISuccessful(CrawlURI)
public void fireCrawledURINeedRetryEvent(CrawlURI curi)
curi
- - The CrawlURI that will be sent with the event notification.CrawlURIDispositionListener.crawledURINeedRetry(CrawlURI)
public void fireCrawledURIDisregardEvent(CrawlURI curi)
curi
- -
The CrawlURI that will be sent with the event notification.CrawlURIDispositionListener.crawledURIDisregard(CrawlURI)
public void fireCrawledURIFailureEvent(CrawlURI curi)
curi
- - The CrawlURI that will be sent with the event notification.CrawlURIDispositionListener.crawledURIFailure(CrawlURI)
protected void runFrontierRecover(java.lang.String recoverPath) throws javax.management.AttributeNotFoundException, javax.management.MBeanException, javax.management.ReflectionException, FatalConfigurationException
javax.management.AttributeNotFoundException
javax.management.MBeanException
javax.management.ReflectionException
FatalConfigurationException
public java.io.File getLogsDir()
public java.io.File getSettingsDir(java.lang.String key) throws javax.management.AttributeNotFoundException
key
in settings.
If directory does not exist, it and all intermediary dirs
will be created.
key
- Key to use going to settings.
key
.
javax.management.AttributeNotFoundException
protected void restoreStatisticsTracker(MapType loggers, java.lang.String replaceName) throws FatalConfigurationException
FatalConfigurationException
protected FatalConfigurationException convertToFatalConfigurationException(java.lang.Exception e)
protected void rotateLogFiles(java.lang.String generationSuffix) throws java.io.IOException
java.io.IOException
public void closeLogFiles()
public StatisticsTracking getStatistics()
protected void sendCrawlStateChangeEvent(java.lang.Object newState, java.lang.String message)
newState
- State change we're to tell listeners' about.message
- Message on state change.for special case event sending
telling listeners to checkpoint.
protected void sendCheckpointEvent(java.io.File checkpointDir) throws java.lang.Exception
sendCrawlStateChangeEvent(Object, String)
because checkpointing
throws an Exception (Didn't want to have to wrap all of the
sendCrawlStateChangeEvent in try/catches).
checkpointDir
- Where to write checkpoint state to.
java.lang.Exception
public void requestCrawlStart()
protected void completeStop()
void completePause()
public void requestCrawlCheckpoint() throws java.lang.IllegalStateException
java.lang.IllegalStateException
- Thrown if crawl is not in paused state
(Crawl must be first paused before checkpointing).public boolean isCheckpointing()
void checkpoint() throws java.lang.Exception
CrawlStatusListener.crawlCheckpoint(File)
invocation and then in their #initialize if a module,
or in their #initialTask if a processor, check with the CrawlController
if its checkpoint recovery. If it is, read in their old state from the
pointed to checkpoint directory.
Default access only to be called by Checkpointer.
java.lang.Exception
protected void copySettings(java.io.File checkpointDir) throws java.io.IOException
checkpointDir
- Directory to write checkpoint to.
java.io.IOException
protected void checkpointBdb(java.io.File checkpointDir) throws com.sleepycat.je.DatabaseException, java.io.IOException, java.lang.RuntimeException
int totalCleaned = 0; for (int cleaned = 0; (cleaned = this.bdbEnvironment.cleanLog()) != 0; totalCleaned += cleaned) { LOGGER.fine("Cleaned " + cleaned + " log files."); }
I also used to do a sync. But, from Mark Hayes, sync and checkpoint are effectively same thing only sync is not configurable. He suggests doing one or the other:
MS: Reading code, Environment.sync() is a checkpoint. Looks like I don't need to call a checkpoint after calling a sync?
MH: Right, they're almost the same thing -- just do one or the other, not both. With the new API, you'll need to do a checkpoint not a sync, because the sync() method has no config parameter. Don't worry -- it's fine to do a checkpoint even though you're not using.
checkpointDir
- Directory to write checkpoint to.
com.sleepycat.je.DatabaseException
java.io.IOException
java.lang.RuntimeException
- Thrown if failed setup of new bdb environment.protected void processBdbLogs(java.io.File checkpointDir, java.lang.String lastBdbCheckpointLog) throws java.io.IOException
java.io.IOException
protected java.lang.String getBdbLogFileName(long index)
protected void setBdbjeBkgrdThreads(com.sleepycat.je.EnvironmentConfig config, java.util.List threads, java.lang.String setting)
public Checkpoint getCheckpointRecover()
isCheckpointRecover()
public static Checkpoint getCheckpointRecover(CrawlOrder order)
public static boolean isCheckpointRecover(CrawlOrder order)
public boolean isCheckpointRecover()
getCheckpointRecover()
to get at Checkpoint instance
that has info on checkpoint directory being recovered from.public void requestCrawlStop()
public void requestCrawlStop(java.lang.String message)
message
- public void beginCrawlStop()
public void requestCrawlPause()
public boolean isPaused()
public boolean isPausing()
public boolean isRunning()
public void requestCrawlResume()
public int getActiveToeCount()
public CrawlOrder getOrder()
public ServerCache getServerCache()
public void setOrder(CrawlOrder o)
o
- public Frontier getFrontier()
public CrawlScope getScope()
public ProcessorChainList getProcessorChainList()
public ProcessorChain getFirstProcessorChain()
public ProcessorChain getPostprocessorChain()
public java.io.File getDisk()
public java.io.File getScratchDisk()
public java.io.File getStateDisk()
public int getToeCount()
ToePool.getToeCount()
public ToePool getToePool()
public java.lang.String oneLineReportThreads()
public void kickUpdate()
public SettingsHandler getSettingsHandler()
public void killThread(int threadNumber, boolean replace)
ToePool.killThread(int, boolean)
.
threadNumber
- Thread to kill.replace
- Should thread be replaced.ToePool.killThread(int, boolean)
public void addToManifest(java.lang.String file, char type, boolean bundle)
file
- The filename (with absolute path) of the file to addtype
- The type of the filebundle
- Should the file be included in a typical bundling of
crawler files.MANIFEST_CONFIG_FILE
,
MANIFEST_LOG_FILE
,
MANIFEST_REPORT_FILE
public void checkFinish()
public boolean atFinish()
public void singleThreadMode()
public void multiThreadMode()
public void acquireContinuePermission()
public void releaseContinuePermission()
public void freeReserveMemory()
public void toePaused()
public void toeEnded()
public void addOrderToManifest()
public void logUriError(org.apache.commons.httpclient.URIException e, UURI u, java.lang.CharSequence l)
e
- URIException encounteredu
- CrawlURI where problem occurredl
- String which could not be interpreted as URI without exceptionpublic java.lang.String[] getReports()
Reporter
getReports
in interface Reporter
public void reportTo(java.io.PrintWriter writer)
Reporter
reportTo
in interface Reporter
writer
- to receive reportpublic java.lang.String singleLineReport()
Reporter
singleLineReport
in interface Reporter
public void reportTo(java.lang.String name, java.io.PrintWriter writer)
Reporter
reportTo
in interface Reporter
writer
- to receive reportprotected void reportManifestTo(java.io.PrintWriter writer)
writer
- Where to write report to.protected void reportProcessorsTo(java.io.PrintWriter writer)
writer
- Where to write to.Processor.report()
public void singleLineReportTo(java.io.PrintWriter writer)
Reporter
singleLineReportTo
in interface Reporter
writer
- to receive reportpublic java.lang.String singleLineLegend()
Reporter
singleLineLegend
in interface Reporter
public <V> ObjectIdentityCache<java.lang.String,V> getBigMap(java.lang.String dbName, java.lang.Class<? super V> valueClass) throws java.lang.Exception
dbName
- Name to give any associated database. Also used
as part of name serializing out bigmap. Needs to be unique to a crawl.keyClass
- Class of keys we'll be using.valueClass
- Class of values we'll be using.
java.lang.Exception
protected <K,V> ObjectIdentityBdbCache<V> getOIBC(java.lang.String dbName, java.lang.Class<? super V> valueClass) throws java.lang.Exception
dbName
- Name to give any associated database. Also used
as part of name serializing out bigmap. Needs to be unique to a crawl.keyClass
- Class of keys we'll be using.valueClass
- Class of values we'll be using.
java.lang.Exception
protected <V> CachedBdbMap<java.lang.String,V> getCBM(java.lang.String dbName, java.lang.Class<? super V> valueClass) throws java.lang.Exception
dbName
- Name to give any associated database. Also used
as part of name serializing out bigmap. Needs to be unique to a crawl.keyClass
- Class of keys we'll be using.valueClass
- Class of values we'll be using.
java.lang.Exception
protected void checkpointBigMaps(java.io.File cpDir) throws java.lang.Exception
java.lang.Exception
public void progressStatisticsEvent(java.util.EventObject e)
e
- Progress statistics event.public void logProgressStatistics(java.lang.String msg)
msg
- Message to write the progress statistics log.public java.lang.Object getState()
public java.io.File getCheckpointsDisk()
public java.util.concurrent.atomic.AtomicInteger getLoopingToes()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |