org.archive.crawler.framework
Class CrawlController

java.lang.Object
  extended by org.archive.crawler.framework.CrawlController
All Implemented Interfaces:
java.io.Serializable, Reporter
Direct Known Subclasses:
CrawlJob.MBeanCrawlController

public class CrawlController
extends java.lang.Object
implements java.io.Serializable, Reporter

CrawlController collects all the classes which cooperate to perform a crawl and provides a high-level interface to the running crawl. As the "global context" for a crawl, subcomponents will often reach each other through the CrawlController.

Author:
Gordon Mohr
See Also:
Serialized Form

Field Summary
static java.lang.Object CHECKPOINTING
           
static java.lang.String CURRENT_LOG_SUFFIX
          suffix to use on active logs
static java.lang.Object FINISHED
           
 java.util.logging.Logger localErrors
          This logger is for job-scoped logging, specifically errors which happen and are handled within a particular processor.
static java.lang.String LOGNAME_CRAWL
           
static java.lang.String LOGNAME_LOCAL_ERRORS
           
static java.lang.String LOGNAME_PROGRESS_STATISTICS
           
static java.lang.String LOGNAME_RUNTIME_ERRORS
           
static java.lang.String LOGNAME_URI_ERRORS
           
static char MANIFEST_CONFIG_FILE
          abbrieviation label for config files in manifest
static char MANIFEST_LOG_FILE
          abbrieviation label for log files in manifest
static java.lang.String MANIFEST_REPORT
           
static char MANIFEST_REPORT_FILE
          abbrieviation label for report files in manifest
static java.lang.Object NASCENT
           
static java.lang.Object PAUSED
           
static java.lang.Object PAUSING
           
static java.lang.Object PREPARING
           
static java.lang.String PROCESSORS_REPORT
           
protected  java.util.ArrayList<CrawlURIDispositionListener> registeredCrawlURIDispositionListeners
           
 java.util.logging.Logger reports
          Logger to hold job summary report.
protected static java.lang.String[] REPORTS
           
static java.lang.Object RUNNING
           
 java.util.logging.Logger runtimeErrors
          This logger contains unexpected runtime errors.
static java.lang.Object STARTED
           
protected  StatisticsTracking statistics
           
static java.lang.Object STOPPING
           
 java.util.logging.Logger uriErrors
          Special log for URI format problems, wherever they may occur.
 java.util.logging.Logger uriProcessing
          Crawl progress logger.
 
Constructor Summary
CrawlController()
          Default constructor
 
Method Summary
 void acquireContinuePermission()
          Proceed only if allowed, giving CrawlController a chance to enforce single-thread mode.
 void addCrawlStatusListener(CrawlStatusListener cl)
          Register for CrawlStatus events.
 void addCrawlURIDispositionListener(CrawlURIDispositionListener cl)
          Register for CrawlURIDisposition events.
 void addOrderToManifest()
          Add order file contents to manifest.
 void addToManifest(java.lang.String file, char type, boolean bundle)
          Add a file to the manifest of files used/generated by the current crawl.
 boolean atFinish()
          Evaluate if the crawl should stop because it is finished, without actually stopping the crawl.
 void beginCrawlStop()
          Start the process of stopping the crawl.
 void checkFinish()
          Evaluate if the crawl should stop because it is finished.
(package private)  void checkpoint()
          Run checkpointing.
protected  void checkpointBdb(java.io.File checkpointDir)
          Checkpoint bdb.
protected  void checkpointBigMaps(java.io.File cpDir)
           
 void closeLogFiles()
          Close all log files and remove handlers from loggers.
(package private)  void completePause()
           
protected  void completeStop()
          Called when the last toethread exits.
protected  FatalConfigurationException convertToFatalConfigurationException(java.lang.Exception e)
           
protected  void copySettings(java.io.File checkpointDir)
          Copy off the settings.
 void fireCrawledURIDisregardEvent(CrawlURI curi)
          Allows an external class to raise a CrawlURIDispostion crawledURIDisregard event that will be broadcast to all listeners that have registered with the CrawlController.
 void fireCrawledURIFailureEvent(CrawlURI curi)
          Allows an external class to raise a CrawlURIDispostion crawledURIFailure event that will be broadcast to all listeners that have registered with the CrawlController.
 void fireCrawledURINeedRetryEvent(CrawlURI curi)
          Allows an external class to raise a CrawlURIDispostion crawledURINeedRetry event that will be broadcast to all listeners that have registered with the CrawlController.
 void fireCrawledURISuccessfulEvent(CrawlURI curi)
          Allows an external class to raise a CrawlURIDispostion crawledURISuccessful event that will be broadcast to all listeners that have registered with the CrawlController.
 void freeReserveMemory()
           
 int getActiveToeCount()
           
 EnhancedEnvironment getBdbEnvironment()
           
protected  java.lang.String getBdbLogFileName(long index)
           
<V> ObjectIdentityCache<java.lang.String,V>
getBigMap(java.lang.String dbName, java.lang.Class<? super V> valueClass)
          Call this method to get instance of the crawler BigMap implementation.
protected
<V> CachedBdbMap<java.lang.String,V>
getCBM(java.lang.String dbName, java.lang.Class<? super V> valueClass)
          Deprecated.  
protected  boolean getCheckpointCopyBdbjeLogs()
           
 Checkpoint getCheckpointRecover()
          Get recover checkpoint.
static Checkpoint getCheckpointRecover(CrawlOrder order)
           
 java.io.File getCheckpointsDisk()
           
 com.sleepycat.bind.serial.StoredClassCatalog getClassCatalog()
          Deprecated. use EnhancedEnvironment's getClassCatalog() instead
 java.io.File getDisk()
          Get the 'working' directory of the current crawl.
 ProcessorChain getFirstProcessorChain()
          Get the first processor chain.
 Frontier getFrontier()
           
 java.io.File getLogsDir()
           
 java.util.concurrent.atomic.AtomicInteger getLoopingToes()
           
protected
<K,V> ObjectIdentityBdbCache<V>
getOIBC(java.lang.String dbName, java.lang.Class<? super V> valueClass)
          Implement 'big map' with ObjectIdentityBdbCache.
 CrawlOrder getOrder()
           
 ProcessorChain getPostprocessorChain()
          Get the postprocessor chain.
 ProcessorChainList getProcessorChainList()
          Get the list of processor chains.
 java.lang.String[] getReports()
          Get an array of report names offered by this Reporter.
 CrawlScope getScope()
           
 java.io.File getScratchDisk()
           
 ServerCache getServerCache()
           
 java.io.File getSettingsDir(java.lang.String key)
          Return fullpath to the directory named by key in settings.
 SettingsHandler getSettingsHandler()
           
 java.lang.Object getState()
           
 java.io.File getStateDisk()
           
 StatisticsTracking getStatistics()
           
 int getToeCount()
           
 ToePool getToePool()
           
 void initialize(SettingsHandler sH)
          Starting from nothing, set up CrawlController and associated classes to be ready for a first crawl.
 void installThreadContextSettingsHandler()
          Utility method to install this crawl's SettingsHandler into the 'global' (for this thread) holder, so that any subsequent deserialization operations in this thread can find it.
 boolean isCheckpointing()
           
 boolean isCheckpointRecover()
           
static boolean isCheckpointRecover(CrawlOrder order)
           
 boolean isPaused()
          Tell if the controller is paused
 boolean isPausing()
           
 boolean isRunning()
           
 void kickUpdate()
          While many settings will update automatically when the SettingsHandler is modified, some settings need to be explicitly changed to reflect new settings.
 void killThread(int threadNumber, boolean replace)
          Kills a thread.
 void logProgressStatistics(java.lang.String msg)
          Log to the progress statistics log.
 void logUriError(org.apache.commons.httpclient.URIException e, UURI u, java.lang.CharSequence l)
          Log a URIException from deep inside other components to the crawl's shared log.
 void multiThreadMode()
          Go to back to regular multi thread mode, where all ToeThreads may proceed at once
 java.lang.String oneLineReportThreads()
           
protected  void processBdbLogs(java.io.File checkpointDir, java.lang.String lastBdbCheckpointLog)
           
 void progressStatisticsEvent(java.util.EventObject e)
          Called whenever progress statistics logging event.
 void releaseContinuePermission()
          Relinquish continue permission at end of processing (allowing another thread to proceed if in single-thread mode).
protected  void reportManifestTo(java.io.PrintWriter writer)
           
protected  void reportProcessorsTo(java.io.PrintWriter writer)
          Compiles and returns a human readable report on the active processors.
 void reportTo(java.io.PrintWriter writer)
          Make a default report to the passed-in Writer.
 void reportTo(java.lang.String name, java.io.PrintWriter writer)
          Make a report of the given name to the passed-in Writer, If null, give the default report.
 void requestCrawlCheckpoint()
          Request a checkpoint.
 void requestCrawlPause()
          Stop the crawl temporarly.
 void requestCrawlResume()
          Resume crawl from paused state
 void requestCrawlStart()
          Operator requested crawl begin
 void requestCrawlStop()
          Operator requested for crawl to stop.
 void requestCrawlStop(java.lang.String message)
          Operator requested for crawl to stop.
protected  void restoreStatisticsTracker(MapType loggers, java.lang.String replaceName)
           
protected  void rotateLogFiles(java.lang.String generationSuffix)
           
protected  void runFrontierRecover(java.lang.String recoverPath)
           
protected  void sendCheckpointEvent(java.io.File checkpointDir)
          Send the checkpoint event.
protected  void sendCrawlStateChangeEvent(java.lang.Object newState, java.lang.String message)
          Send crawl change event to all listeners.
protected  void setBdbjeBkgrdThreads(com.sleepycat.je.EnvironmentConfig config, java.util.List threads, java.lang.String setting)
           
 void setOrder(CrawlOrder o)
           
protected  void setupCheckpointRecover()
          Does setup of checkpoint recover.
 java.lang.String singleLineLegend()
          Return a legend for the single-line summary report as a String.
 java.lang.String singleLineReport()
          Return a short single-line summary report as a String.
 void singleLineReportTo(java.io.PrintWriter writer)
          Make a single-line summary report to the passed-in writer
 void singleThreadMode()
          Go to single thread mode, where only one ToeThread may proceed at a time.
 void toeEnded()
          Note that a ToeThread ended, possibly completing the crawl-stop.
 void toePaused()
          Note that a ToeThread reached paused condition, possibly completing the crawl-pause.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MANIFEST_CONFIG_FILE

public static final char MANIFEST_CONFIG_FILE
abbrieviation label for config files in manifest

See Also:
Constant Field Values

MANIFEST_REPORT_FILE

public static final char MANIFEST_REPORT_FILE
abbrieviation label for report files in manifest

See Also:
Constant Field Values

MANIFEST_LOG_FILE

public static final char MANIFEST_LOG_FILE
abbrieviation label for log files in manifest

See Also:
Constant Field Values

LOGNAME_PROGRESS_STATISTICS

public static final java.lang.String LOGNAME_PROGRESS_STATISTICS
See Also:
Constant Field Values

LOGNAME_URI_ERRORS

public static final java.lang.String LOGNAME_URI_ERRORS
See Also:
Constant Field Values

LOGNAME_RUNTIME_ERRORS

public static final java.lang.String LOGNAME_RUNTIME_ERRORS
See Also:
Constant Field Values

LOGNAME_LOCAL_ERRORS

public static final java.lang.String LOGNAME_LOCAL_ERRORS
See Also:
Constant Field Values

LOGNAME_CRAWL

public static final java.lang.String LOGNAME_CRAWL
See Also:
Constant Field Values

NASCENT

public static final java.lang.Object NASCENT

RUNNING

public static final java.lang.Object RUNNING

PAUSED

public static final java.lang.Object PAUSED

PAUSING

public static final java.lang.Object PAUSING

CHECKPOINTING

public static final java.lang.Object CHECKPOINTING

STOPPING

public static final java.lang.Object STOPPING

FINISHED

public static final java.lang.Object FINISHED

STARTED

public static final java.lang.Object STARTED

PREPARING

public static final java.lang.Object PREPARING

CURRENT_LOG_SUFFIX

public static final java.lang.String CURRENT_LOG_SUFFIX
suffix to use on active logs

See Also:
Constant Field Values

uriProcessing

public transient java.util.logging.Logger uriProcessing
Crawl progress logger. No exceptions. Logs summary result of each url processing.


runtimeErrors

public transient java.util.logging.Logger runtimeErrors
This logger contains unexpected runtime errors. Would contain errors trying to set up a job or failures inside processors that they are not prepared to recover from.


localErrors

public transient java.util.logging.Logger localErrors
This logger is for job-scoped logging, specifically errors which happen and are handled within a particular processor. Examples would be socket timeouts, exceptions thrown by extractors, etc.


uriErrors

public transient java.util.logging.Logger uriErrors
Special log for URI format problems, wherever they may occur.


reports

public transient java.util.logging.Logger reports
Logger to hold job summary report. Large state reports made at infrequent intervals (e.g. job ending) go here.


statistics

protected StatisticsTracking statistics

registeredCrawlURIDispositionListeners

protected transient java.util.ArrayList<CrawlURIDispositionListener> registeredCrawlURIDispositionListeners

PROCESSORS_REPORT

public static final java.lang.String PROCESSORS_REPORT
See Also:
Constant Field Values

MANIFEST_REPORT

public static final java.lang.String MANIFEST_REPORT
See Also:
Constant Field Values

REPORTS

protected static final java.lang.String[] REPORTS
Constructor Detail

CrawlController

public CrawlController()
Default constructor

Method Detail

initialize

public void initialize(SettingsHandler sH)
                throws InitializationException
Starting from nothing, set up CrawlController and associated classes to be ready for a first crawl.

Parameters:
sH - Settings handler.
Throws:
InitializationException

installThreadContextSettingsHandler

public void installThreadContextSettingsHandler()
Utility method to install this crawl's SettingsHandler into the 'global' (for this thread) holder, so that any subsequent deserialization operations in this thread can find it.

Parameters:
sH -

setupCheckpointRecover

protected void setupCheckpointRecover()
                               throws java.io.IOException
Does setup of checkpoint recover. Copies bdb log files into state dir.

Throws:
java.io.IOException

getCheckpointCopyBdbjeLogs

protected boolean getCheckpointCopyBdbjeLogs()

getBdbEnvironment

public EnhancedEnvironment getBdbEnvironment()
Returns:
the shared EnhancedEnvironment

getClassCatalog

public com.sleepycat.bind.serial.StoredClassCatalog getClassCatalog()
Deprecated. use EnhancedEnvironment's getClassCatalog() instead


addCrawlStatusListener

public void addCrawlStatusListener(CrawlStatusListener cl)
Register for CrawlStatus events.

Parameters:
cl - a class implementing the CrawlStatusListener interface
See Also:
CrawlStatusListener

addCrawlURIDispositionListener

public void addCrawlURIDispositionListener(CrawlURIDispositionListener cl)
Register for CrawlURIDisposition events.

Parameters:
cl - a class implementing the CrawlURIDispostionListener interface
See Also:
CrawlURIDispositionListener

fireCrawledURISuccessfulEvent

public void fireCrawledURISuccessfulEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURISuccessful event that will be broadcast to all listeners that have registered with the CrawlController.

Parameters:
curi - - The CrawlURI that will be sent with the event notification.
See Also:
CrawlURIDispositionListener.crawledURISuccessful(CrawlURI)

fireCrawledURINeedRetryEvent

public void fireCrawledURINeedRetryEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURINeedRetry event that will be broadcast to all listeners that have registered with the CrawlController.

Parameters:
curi - - The CrawlURI that will be sent with the event notification.
See Also:
CrawlURIDispositionListener.crawledURINeedRetry(CrawlURI)

fireCrawledURIDisregardEvent

public void fireCrawledURIDisregardEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURIDisregard event that will be broadcast to all listeners that have registered with the CrawlController.

Parameters:
curi - - The CrawlURI that will be sent with the event notification.
See Also:
CrawlURIDispositionListener.crawledURIDisregard(CrawlURI)

fireCrawledURIFailureEvent

public void fireCrawledURIFailureEvent(CrawlURI curi)
Allows an external class to raise a CrawlURIDispostion crawledURIFailure event that will be broadcast to all listeners that have registered with the CrawlController.

Parameters:
curi - - The CrawlURI that will be sent with the event notification.
See Also:
CrawlURIDispositionListener.crawledURIFailure(CrawlURI)

runFrontierRecover

protected void runFrontierRecover(java.lang.String recoverPath)
                           throws javax.management.AttributeNotFoundException,
                                  javax.management.MBeanException,
                                  javax.management.ReflectionException,
                                  FatalConfigurationException
Throws:
javax.management.AttributeNotFoundException
javax.management.MBeanException
javax.management.ReflectionException
FatalConfigurationException

getLogsDir

public java.io.File getLogsDir()
Returns:
The logging directory or null if problem reading the settings.

getSettingsDir

public java.io.File getSettingsDir(java.lang.String key)
                            throws javax.management.AttributeNotFoundException
Return fullpath to the directory named by key in settings. If directory does not exist, it and all intermediary dirs will be created.

Parameters:
key - Key to use going to settings.
Returns:
Full path to directory named by key.
Throws:
javax.management.AttributeNotFoundException

restoreStatisticsTracker

protected void restoreStatisticsTracker(MapType loggers,
                                        java.lang.String replaceName)
                                 throws FatalConfigurationException
Throws:
FatalConfigurationException

convertToFatalConfigurationException

protected FatalConfigurationException convertToFatalConfigurationException(java.lang.Exception e)

rotateLogFiles

protected void rotateLogFiles(java.lang.String generationSuffix)
                       throws java.io.IOException
Throws:
java.io.IOException

closeLogFiles

public void closeLogFiles()
Close all log files and remove handlers from loggers.


getStatistics

public StatisticsTracking getStatistics()
Returns:
Object this controller is using to track crawl statistics

sendCrawlStateChangeEvent

protected void sendCrawlStateChangeEvent(java.lang.Object newState,
                                         java.lang.String message)
Send crawl change event to all listeners.

Parameters:
newState - State change we're to tell listeners' about.
message - Message on state change.
See Also:
for special case event sending telling listeners to checkpoint.

sendCheckpointEvent

protected void sendCheckpointEvent(java.io.File checkpointDir)
                            throws java.lang.Exception
Send the checkpoint event. Has its own method apart from sendCrawlStateChangeEvent(Object, String) because checkpointing throws an Exception (Didn't want to have to wrap all of the sendCrawlStateChangeEvent in try/catches).

Parameters:
checkpointDir - Where to write checkpoint state to.
Throws:
java.lang.Exception

requestCrawlStart

public void requestCrawlStart()
Operator requested crawl begin


completeStop

protected void completeStop()
Called when the last toethread exits.


completePause

void completePause()

requestCrawlCheckpoint

public void requestCrawlCheckpoint()
                            throws java.lang.IllegalStateException
Request a checkpoint. Sets a checkpointing thread running.

Throws:
java.lang.IllegalStateException - Thrown if crawl is not in paused state (Crawl must be first paused before checkpointing).

isCheckpointing

public boolean isCheckpointing()
Returns:
True if checkpointing.

checkpoint

void checkpoint()
          throws java.lang.Exception
Run checkpointing. CrawlController takes care of managing the checkpointing/serializing of bdb, the StatisticsTracker, and the CheckpointContext. Other modules that want to revive themselves on checkpoint recovery need to save state during their CrawlStatusListener.crawlCheckpoint(File) invocation and then in their #initialize if a module, or in their #initialTask if a processor, check with the CrawlController if its checkpoint recovery. If it is, read in their old state from the pointed to checkpoint directory.

Default access only to be called by Checkpointer.

Throws:
java.lang.Exception

copySettings

protected void copySettings(java.io.File checkpointDir)
                     throws java.io.IOException
Copy off the settings.

Parameters:
checkpointDir - Directory to write checkpoint to.
Throws:
java.io.IOException

checkpointBdb

protected void checkpointBdb(java.io.File checkpointDir)
                      throws com.sleepycat.je.DatabaseException,
                             java.io.IOException,
                             java.lang.RuntimeException
Checkpoint bdb. I used do a call to log cleaning as suggested in je-2.0 javadoc but takes way too much time (20minutes for a crawl of 1million items). Assume cleaner is keeping up. Below was log cleaning loop .
int totalCleaned = 0;
 for (int cleaned = 0; (cleaned = this.bdbEnvironment.cleanLog()) != 0;
  totalCleaned += cleaned) {
      LOGGER.fine("Cleaned " + cleaned + " log files.");
 }
 

I also used to do a sync. But, from Mark Hayes, sync and checkpoint are effectively same thing only sync is not configurable. He suggests doing one or the other:

MS: Reading code, Environment.sync() is a checkpoint. Looks like I don't need to call a checkpoint after calling a sync?

MH: Right, they're almost the same thing -- just do one or the other, not both. With the new API, you'll need to do a checkpoint not a sync, because the sync() method has no config parameter. Don't worry -- it's fine to do a checkpoint even though you're not using.

Parameters:
checkpointDir - Directory to write checkpoint to.
Throws:
com.sleepycat.je.DatabaseException
java.io.IOException
java.lang.RuntimeException - Thrown if failed setup of new bdb environment.

processBdbLogs

protected void processBdbLogs(java.io.File checkpointDir,
                              java.lang.String lastBdbCheckpointLog)
                       throws java.io.IOException
Throws:
java.io.IOException

getBdbLogFileName

protected java.lang.String getBdbLogFileName(long index)

setBdbjeBkgrdThreads

protected void setBdbjeBkgrdThreads(com.sleepycat.je.EnvironmentConfig config,
                                    java.util.List threads,
                                    java.lang.String setting)

getCheckpointRecover

public Checkpoint getCheckpointRecover()
Get recover checkpoint. Returns null if we're NOT in recover mode. Looks at ATTR_RECOVER_PATH and if its a directory, assumes checkpoint recover. If checkpoint mode, returns Checkpoint instance if checkpoint was VALID (else null).

Returns:
Checkpoint instance if we're in recover checkpoint mode and the pointed-to checkpoint was valid.
See Also:
isCheckpointRecover()

getCheckpointRecover

public static Checkpoint getCheckpointRecover(CrawlOrder order)

isCheckpointRecover

public static boolean isCheckpointRecover(CrawlOrder order)

isCheckpointRecover

public boolean isCheckpointRecover()
Returns:
True if we're in checkpoint recover mode. Call getCheckpointRecover() to get at Checkpoint instance that has info on checkpoint directory being recovered from.

requestCrawlStop

public void requestCrawlStop()
Operator requested for crawl to stop.


requestCrawlStop

public void requestCrawlStop(java.lang.String message)
Operator requested for crawl to stop.

Parameters:
message -

beginCrawlStop

public void beginCrawlStop()
Start the process of stopping the crawl.


requestCrawlPause

public void requestCrawlPause()
Stop the crawl temporarly.


isPaused

public boolean isPaused()
Tell if the controller is paused

Returns:
true if paused

isPausing

public boolean isPausing()

isRunning

public boolean isRunning()

requestCrawlResume

public void requestCrawlResume()
Resume crawl from paused state


getActiveToeCount

public int getActiveToeCount()
Returns:
Active toe thread count.

getOrder

public CrawlOrder getOrder()
Returns:
The order file instance.

getServerCache

public ServerCache getServerCache()
Returns:
The server cache instance.

setOrder

public void setOrder(CrawlOrder o)
Parameters:
o -

getFrontier

public Frontier getFrontier()
Returns:
The frontier.

getScope

public CrawlScope getScope()
Returns:
This crawl scope.

getProcessorChainList

public ProcessorChainList getProcessorChainList()
Get the list of processor chains.

Returns:
the list of processor chains.

getFirstProcessorChain

public ProcessorChain getFirstProcessorChain()
Get the first processor chain.

Returns:
the first processor chain.

getPostprocessorChain

public ProcessorChain getPostprocessorChain()
Get the postprocessor chain.

Returns:
the postprocessor chain.

getDisk

public java.io.File getDisk()
Get the 'working' directory of the current crawl.

Returns:
the 'working' directory of the current crawl.

getScratchDisk

public java.io.File getScratchDisk()
Returns:
Scratch disk location.

getStateDisk

public java.io.File getStateDisk()
Returns:
State disk location.

getToeCount

public int getToeCount()
Returns:
The number of ToeThreads
See Also:
ToePool.getToeCount()

getToePool

public ToePool getToePool()
Returns:
The ToePool

oneLineReportThreads

public java.lang.String oneLineReportThreads()
Returns:
toepool one-line report

kickUpdate

public void kickUpdate()
While many settings will update automatically when the SettingsHandler is modified, some settings need to be explicitly changed to reflect new settings. This includes, number of toe threads and seeds.


getSettingsHandler

public SettingsHandler getSettingsHandler()
Returns:
The settings handler.

killThread

public void killThread(int threadNumber,
                       boolean replace)
Kills a thread. For details see ToePool.killThread(int, boolean).

Parameters:
threadNumber - Thread to kill.
replace - Should thread be replaced.
See Also:
ToePool.killThread(int, boolean)

addToManifest

public void addToManifest(java.lang.String file,
                          char type,
                          boolean bundle)
Add a file to the manifest of files used/generated by the current crawl. TODO: Its possible for a file to be added twice if reports are force generated midcrawl. Fix.

Parameters:
file - The filename (with absolute path) of the file to add
type - The type of the file
bundle - Should the file be included in a typical bundling of crawler files.
See Also:
MANIFEST_CONFIG_FILE, MANIFEST_LOG_FILE, MANIFEST_REPORT_FILE

checkFinish

public void checkFinish()
Evaluate if the crawl should stop because it is finished.


atFinish

public boolean atFinish()
Evaluate if the crawl should stop because it is finished, without actually stopping the crawl.

Returns:
true if crawl is at a finish-possible state

singleThreadMode

public void singleThreadMode()
Go to single thread mode, where only one ToeThread may proceed at a time. Also acquires the single lock, so no further threads will proceed past an acquireContinuePermission. Caller mush be sure to release lock to allow other threads to proceed one at a time.


multiThreadMode

public void multiThreadMode()
Go to back to regular multi thread mode, where all ToeThreads may proceed at once


acquireContinuePermission

public void acquireContinuePermission()
Proceed only if allowed, giving CrawlController a chance to enforce single-thread mode.


releaseContinuePermission

public void releaseContinuePermission()
Relinquish continue permission at end of processing (allowing another thread to proceed if in single-thread mode).


freeReserveMemory

public void freeReserveMemory()

toePaused

public void toePaused()
Note that a ToeThread reached paused condition, possibly completing the crawl-pause.


toeEnded

public void toeEnded()
Note that a ToeThread ended, possibly completing the crawl-stop.


addOrderToManifest

public void addOrderToManifest()
Add order file contents to manifest. Write configuration files and any files managed by CrawlController to it - files managed by other classes, excluding the settings framework, are responsible for adding their files to the manifest themselves. by calling addToManifest. Call before writing out reports.


logUriError

public void logUriError(org.apache.commons.httpclient.URIException e,
                        UURI u,
                        java.lang.CharSequence l)
Log a URIException from deep inside other components to the crawl's shared log.

Parameters:
e - URIException encountered
u - CrawlURI where problem occurred
l - String which could not be interpreted as URI without exception

getReports

public java.lang.String[] getReports()
Description copied from interface: Reporter
Get an array of report names offered by this Reporter. A name in brackets indicates a free-form String, in accordance with the informal description inside the brackets, may yield a useful report.

Specified by:
getReports in interface Reporter
Returns:
String array of report names, empty if there is only one report type

reportTo

public void reportTo(java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a default report to the passed-in Writer. Should be equivalent to reportTo(null, writer)

Specified by:
reportTo in interface Reporter
Parameters:
writer - to receive report

singleLineReport

public java.lang.String singleLineReport()
Description copied from interface: Reporter
Return a short single-line summary report as a String.

Specified by:
singleLineReport in interface Reporter
Returns:
String single-line summary report

reportTo

public void reportTo(java.lang.String name,
                     java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a report of the given name to the passed-in Writer, If null, give the default report.

Specified by:
reportTo in interface Reporter
writer - to receive report

reportManifestTo

protected void reportManifestTo(java.io.PrintWriter writer)
Parameters:
writer - Where to write report to.

reportProcessorsTo

protected void reportProcessorsTo(java.io.PrintWriter writer)
Compiles and returns a human readable report on the active processors.

Parameters:
writer - Where to write to.
See Also:
Processor.report()

singleLineReportTo

public void singleLineReportTo(java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a single-line summary report to the passed-in writer

Specified by:
singleLineReportTo in interface Reporter
Parameters:
writer - to receive report

singleLineLegend

public java.lang.String singleLineLegend()
Description copied from interface: Reporter
Return a legend for the single-line summary report as a String.

Specified by:
singleLineLegend in interface Reporter
Returns:
String single-line summary legend

getBigMap

public <V> ObjectIdentityCache<java.lang.String,V> getBigMap(java.lang.String dbName,
                                                             java.lang.Class<? super V> valueClass)
                                                  throws java.lang.Exception
Call this method to get instance of the crawler BigMap implementation. A "BigMap" is a Map that knows how to manage ever-growing sets of key/value pairs. If we're in a checkpoint recovery, this method will manage reinstantiation of checkpointed bigmaps.

Parameters:
dbName - Name to give any associated database. Also used as part of name serializing out bigmap. Needs to be unique to a crawl.
keyClass - Class of keys we'll be using.
valueClass - Class of values we'll be using.
Returns:
Map that knows how to carry large sets of key/value pairs or if none available, returns instance of HashMap.
Throws:
java.lang.Exception

getOIBC

protected <K,V> ObjectIdentityBdbCache<V> getOIBC(java.lang.String dbName,
                                                  java.lang.Class<? super V> valueClass)
                                     throws java.lang.Exception
Implement 'big map' with ObjectIdentityBdbCache.

Parameters:
dbName - Name to give any associated database. Also used as part of name serializing out bigmap. Needs to be unique to a crawl.
keyClass - Class of keys we'll be using.
valueClass - Class of values we'll be using.
Returns:
Map that knows how to carry large sets of key/value pairs or if none available, returns instance of HashMap.
Throws:
java.lang.Exception

getCBM

protected <V> CachedBdbMap<java.lang.String,V> getCBM(java.lang.String dbName,
                                                      java.lang.Class<? super V> valueClass)
                                           throws java.lang.Exception
Deprecated. 

Implement 'big map' with CachedBdbMap.

Parameters:
dbName - Name to give any associated database. Also used as part of name serializing out bigmap. Needs to be unique to a crawl.
keyClass - Class of keys we'll be using.
valueClass - Class of values we'll be using.
Returns:
Map that knows how to carry large sets of key/value pairs or if none available, returns instance of HashMap.
Throws:
java.lang.Exception

checkpointBigMaps

protected void checkpointBigMaps(java.io.File cpDir)
                          throws java.lang.Exception
Throws:
java.lang.Exception

progressStatisticsEvent

public void progressStatisticsEvent(java.util.EventObject e)
Called whenever progress statistics logging event.

Parameters:
e - Progress statistics event.

logProgressStatistics

public void logProgressStatistics(java.lang.String msg)
Log to the progress statistics log.

Parameters:
msg - Message to write the progress statistics log.

getState

public java.lang.Object getState()
Returns:
CrawlController state.

getCheckpointsDisk

public java.io.File getCheckpointsDisk()

getLoopingToes

public java.util.concurrent.atomic.AtomicInteger getLoopingToes()


Copyright © 2003-2011 Internet Archive. All Rights Reserved.