org.archive.crawler.fetcher
Class FetchHTTP

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.fetcher.FetchHTTP
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, CrawlStatusListener

public class FetchHTTP
extends Processor
implements CoreAttributeConstants, FetchStatusCodes, CrawlStatusListener

HTTP fetcher that uses Apache Jakarta Commons HttpClient library.

Version:
$Id: FetchHTTP.java 6803 2010-04-02 01:03:46Z gojomo $
Author:
Gordon Mohr, Igor Ranitovic, others
See Also:
Serialized Form

Nested Class Summary
(package private)  class FetchHTTP.PostRestore
           
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_ACCEPT_HEADERS
           
static java.lang.String ATTR_BDB_COOKIES
           
static java.lang.String ATTR_DEFAULT_ENCODING
           
static java.lang.String ATTR_DIGEST_ALGORITHM
           
static java.lang.String ATTR_DIGEST_CONTENT
           
static java.lang.String ATTR_FETCH_BANDWIDTH_MAX
           
static java.lang.String ATTR_HTTP_BIND_ADDRESS
           
static java.lang.String ATTR_HTTP_PROXY_HOST
           
static java.lang.String ATTR_HTTP_PROXY_PORT
           
static java.lang.String ATTR_IGNORE_COOKIES
           
static java.lang.String ATTR_LOAD_COOKIES
           
static java.lang.String ATTR_MAX_LENGTH_BYTES
           
static java.lang.String ATTR_MIDFETCH_DECIDE_RULES
          Rules to apply mid-fetch, just after receipt of the response headers before we start to download body.
static java.lang.String ATTR_SAVE_COOKIES
           
static java.lang.String ATTR_SEND_CONNECTION_CLOSE
           
static java.lang.String ATTR_SEND_IF_MODIFIED_SINCE
           
static java.lang.String ATTR_SEND_IF_NONE_MATCH
           
static java.lang.String ATTR_SEND_RANGE
           
static java.lang.String ATTR_SEND_REFERER
           
static java.lang.String ATTR_SOTIMEOUT_MS
           
static java.lang.String ATTR_TIMEOUT_SECONDS
           
static java.lang.String ATTR_TRUST
          SSL trust level setting attribute name.
protected  com.sleepycat.je.Database cookieDb
          Database backing cookie map, if using BDB
static java.lang.String COOKIEDB_NAME
          Name of cookie BDB Database
static java.lang.String DEFAULT_DIGEST_ALGORITHM
          Default algorithm to use for message disgesting.
(package private) static java.lang.Boolean DEFAULT_DIGEST_CONTENT
          Default whether to perform on-the-fly digest hashing of content-bodies.
static java.lang.String DESC_DIGEST_ALGORITHM
           
static java.lang.String DESC_DIGEST_CONTENT
           
static java.lang.String[] DIGEST_ALGORITHMS
           
static java.lang.String HTTP_SCHEME
           
static java.lang.String HTTPS_SCHEME
           
static java.lang.String MD5
           
static java.lang.String RANGE
           
static java.lang.String RANGE_PREFIX
           
static java.lang.String REFERER
           
(package private) static java.lang.String SERVER_CACHE_KEY
           
static java.lang.String SHA1
          The different digest algorithms to choose between, SHA-1 or MD-5 at the moment.
(package private) static java.lang.String SSL_FACTORY_KEY
           
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
FetchHTTP(java.lang.String name)
          Constructor.
 
Method Summary
protected  void addResponseContent(org.apache.commons.httpclient.HttpMethod method, CrawlURI curi)
          This method populates curi with response status and content type.
protected  boolean checkMidfetchAbort(CrawlURI curi, HttpRecorderMethod method, org.apache.commons.httpclient.HttpConnection conn)
           
protected  void cleanupHttp()
          Perform any final cleanup related to the HttpClient instance.
protected  void configureHttp()
           
protected  org.apache.commons.httpclient.HostConfiguration configureMethod(CrawlURI curi, org.apache.commons.httpclient.HttpMethod method)
          Configure the HttpMethod setting options and headers.
 void crawlCheckpoint(java.io.File checkpointDir)
          Called by CrawlController when checkpointing.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
 void crawlEnding(java.lang.String sExitMessage)
          Called when a CrawlController is ending a crawl (for any reason)
 void crawlPaused(java.lang.String statusMessage)
          Called when a CrawlController is actually paused (all threads are idle).
 void crawlPausing(java.lang.String statusMessage)
          Called when a CrawlController is going to be paused.
 void crawlResuming(java.lang.String statusMessage)
          Called when a CrawlController is resuming a crawl that had been paused.
 void crawlStarted(java.lang.String message)
          Called on crawl start.
protected  void doAbort(CrawlURI curi, org.apache.commons.httpclient.HttpMethod method, java.lang.String annotation)
           
 void finalTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected  java.lang.Object getAttributeEither(CrawlURI curi, java.lang.String key)
          Get a value either from inside the CrawlURI instance, or from settings (module attributes).
protected  org.apache.commons.httpclient.auth.AuthScheme getAuthScheme(org.apache.commons.httpclient.HttpMethod method, CrawlURI curi)
           
protected  org.apache.commons.httpclient.HttpClient getHttp()
           
protected  DecideRule getMidfetchRule(java.lang.Object o)
           
protected  void handle401(org.apache.commons.httpclient.HttpMethod method, CrawlURI curi)
          Server is looking for basic/digest auth credentials (RFC2617).
 void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected  void innerProcess(CrawlURI curi)
          Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
protected  void listUsedFiles(java.util.List<java.lang.String> list)
          Those Modules that use files on disk should list them all when this method is called.
 void loadCookies()
          Load cookies from the file specified in the order file.
 void loadCookies(java.lang.String cookiesFile)
          Load cookies from a file before the first fetch.
 java.lang.String report()
          Compiles and returns a report (in human readable form) about the status of the processor.
 void saveCookies()
          Saves cookies to the file specified in the order file.
 void saveCookies(java.lang.String saveCookiesFile)
          Saves cookies to a file.
protected  void setConditionalGetHeader(CrawlURI curi, org.apache.commons.httpclient.HttpMethod method, java.lang.String setting, java.lang.String sourceHeader, java.lang.String targetHeader)
          Set the given conditional-GET header, if the setting is enabled and a suitable value is available in the URI history.
protected  void setSizes(CrawlURI curi, HttpRecorder rec)
          Update CrawlURI internal sizes based on current transaction (and in the case of 304s, history)
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_HTTP_PROXY_HOST

public static final java.lang.String ATTR_HTTP_PROXY_HOST
See Also:
Constant Field Values

ATTR_HTTP_PROXY_PORT

public static final java.lang.String ATTR_HTTP_PROXY_PORT
See Also:
Constant Field Values

ATTR_TIMEOUT_SECONDS

public static final java.lang.String ATTR_TIMEOUT_SECONDS
See Also:
Constant Field Values

ATTR_SOTIMEOUT_MS

public static final java.lang.String ATTR_SOTIMEOUT_MS
See Also:
Constant Field Values

ATTR_MAX_LENGTH_BYTES

public static final java.lang.String ATTR_MAX_LENGTH_BYTES
See Also:
Constant Field Values

ATTR_LOAD_COOKIES

public static final java.lang.String ATTR_LOAD_COOKIES
See Also:
Constant Field Values

ATTR_SAVE_COOKIES

public static final java.lang.String ATTR_SAVE_COOKIES
See Also:
Constant Field Values

ATTR_ACCEPT_HEADERS

public static final java.lang.String ATTR_ACCEPT_HEADERS
See Also:
Constant Field Values

ATTR_DEFAULT_ENCODING

public static final java.lang.String ATTR_DEFAULT_ENCODING
See Also:
Constant Field Values

ATTR_DIGEST_CONTENT

public static final java.lang.String ATTR_DIGEST_CONTENT
See Also:
Constant Field Values

ATTR_DIGEST_ALGORITHM

public static final java.lang.String ATTR_DIGEST_ALGORITHM
See Also:
Constant Field Values

ATTR_FETCH_BANDWIDTH_MAX

public static final java.lang.String ATTR_FETCH_BANDWIDTH_MAX
See Also:
Constant Field Values

DESC_DIGEST_CONTENT

public static final java.lang.String DESC_DIGEST_CONTENT
See Also:
Constant Field Values

DESC_DIGEST_ALGORITHM

public static final java.lang.String DESC_DIGEST_ALGORITHM
See Also:
Constant Field Values

ATTR_TRUST

public static final java.lang.String ATTR_TRUST
SSL trust level setting attribute name.

See Also:
Constant Field Values

DEFAULT_DIGEST_CONTENT

static java.lang.Boolean DEFAULT_DIGEST_CONTENT
Default whether to perform on-the-fly digest hashing of content-bodies.


SHA1

public static final java.lang.String SHA1
The different digest algorithms to choose between, SHA-1 or MD-5 at the moment.

See Also:
Constant Field Values

MD5

public static final java.lang.String MD5
See Also:
Constant Field Values

DIGEST_ALGORITHMS

public static java.lang.String[] DIGEST_ALGORITHMS

DEFAULT_DIGEST_ALGORITHM

public static final java.lang.String DEFAULT_DIGEST_ALGORITHM
Default algorithm to use for message disgesting.

See Also:
Constant Field Values

ATTR_MIDFETCH_DECIDE_RULES

public static final java.lang.String ATTR_MIDFETCH_DECIDE_RULES
Rules to apply mid-fetch, just after receipt of the response headers before we start to download body.

See Also:
Constant Field Values

ATTR_SEND_CONNECTION_CLOSE

public static final java.lang.String ATTR_SEND_CONNECTION_CLOSE
See Also:
Constant Field Values

ATTR_SEND_REFERER

public static final java.lang.String ATTR_SEND_REFERER
See Also:
Constant Field Values

ATTR_SEND_RANGE

public static final java.lang.String ATTR_SEND_RANGE
See Also:
Constant Field Values

ATTR_SEND_IF_MODIFIED_SINCE

public static final java.lang.String ATTR_SEND_IF_MODIFIED_SINCE
See Also:
Constant Field Values

ATTR_SEND_IF_NONE_MATCH

public static final java.lang.String ATTR_SEND_IF_NONE_MATCH
See Also:
Constant Field Values

REFERER

public static final java.lang.String REFERER
See Also:
Constant Field Values

RANGE

public static final java.lang.String RANGE
See Also:
Constant Field Values

RANGE_PREFIX

public static final java.lang.String RANGE_PREFIX
See Also:
Constant Field Values

HTTP_SCHEME

public static final java.lang.String HTTP_SCHEME
See Also:
Constant Field Values

HTTPS_SCHEME

public static final java.lang.String HTTPS_SCHEME
See Also:
Constant Field Values

ATTR_IGNORE_COOKIES

public static final java.lang.String ATTR_IGNORE_COOKIES
See Also:
Constant Field Values

ATTR_BDB_COOKIES

public static final java.lang.String ATTR_BDB_COOKIES
See Also:
Constant Field Values

ATTR_HTTP_BIND_ADDRESS

public static final java.lang.String ATTR_HTTP_BIND_ADDRESS
See Also:
Constant Field Values

cookieDb

protected com.sleepycat.je.Database cookieDb
Database backing cookie map, if using BDB


COOKIEDB_NAME

public static final java.lang.String COOKIEDB_NAME
Name of cookie BDB Database

See Also:
Constant Field Values

SERVER_CACHE_KEY

static final java.lang.String SERVER_CACHE_KEY
See Also:
Constant Field Values

SSL_FACTORY_KEY

static final java.lang.String SSL_FACTORY_KEY
See Also:
Constant Field Values
Constructor Detail

FetchHTTP

public FetchHTTP(java.lang.String name)
Constructor.

Parameters:
name - Name of this processor.
Method Detail

innerProcess

protected void innerProcess(CrawlURI curi)
                     throws java.lang.InterruptedException
Description copied from class: Processor
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

Overrides:
innerProcess in class Processor
Parameters:
curi - The CrawlURI being processed.
Throws:
java.lang.InterruptedException

setSizes

protected void setSizes(CrawlURI curi,
                        HttpRecorder rec)
Update CrawlURI internal sizes based on current transaction (and in the case of 304s, history)

Parameters:
curi - CrawlURI
rec - HttpRecorder

doAbort

protected void doAbort(CrawlURI curi,
                       org.apache.commons.httpclient.HttpMethod method,
                       java.lang.String annotation)

checkMidfetchAbort

protected boolean checkMidfetchAbort(CrawlURI curi,
                                     HttpRecorderMethod method,
                                     org.apache.commons.httpclient.HttpConnection conn)

getMidfetchRule

protected DecideRule getMidfetchRule(java.lang.Object o)

addResponseContent

protected void addResponseContent(org.apache.commons.httpclient.HttpMethod method,
                                  CrawlURI curi)
This method populates curi with response status and content type.

Parameters:
curi - CrawlURI to populate.
method - Method to get response status and headers from.

configureMethod

protected org.apache.commons.httpclient.HostConfiguration configureMethod(CrawlURI curi,
                                                                          org.apache.commons.httpclient.HttpMethod method)
Configure the HttpMethod setting options and headers.

Parameters:
curi - CrawlURI from which we pull configuration.
method - The Method to configure.
Returns:
HostConfiguration copy customized for this CrawlURI

setConditionalGetHeader

protected void setConditionalGetHeader(CrawlURI curi,
                                       org.apache.commons.httpclient.HttpMethod method,
                                       java.lang.String setting,
                                       java.lang.String sourceHeader,
                                       java.lang.String targetHeader)
Set the given conditional-GET header, if the setting is enabled and a suitable value is available in the URI history.

Parameters:
curi - source CrawlURI
method - HTTP operation pending
setting - true/false enablement setting name to consult
sourceHeader - header to consult in URI history
targetHeader - header to set if possible

getAttributeEither

protected java.lang.Object getAttributeEither(CrawlURI curi,
                                              java.lang.String key)
Get a value either from inside the CrawlURI instance, or from settings (module attributes).

Parameters:
curi - CrawlURI to consult
key - key to lookup
Returns:
value from either CrawlURI (preferred) or settings

handle401

protected void handle401(org.apache.commons.httpclient.HttpMethod method,
                         CrawlURI curi)
Server is looking for basic/digest auth credentials (RFC2617). If we have any, put them into the CrawlURI and have it come around again. Presence of the credential serves as flag to frontier to requeue promptly. If we already tried this domain and still got a 401, then our credentials are bad. Remove them and let this curi die.

Parameters:
method - Method that got a 401.
curi - CrawlURI that got a 401.

getAuthScheme

protected org.apache.commons.httpclient.auth.AuthScheme getAuthScheme(org.apache.commons.httpclient.HttpMethod method,
                                                                      CrawlURI curi)
Parameters:
method - Method that got a 401.
curi - CrawlURI that got a 401.
Returns:
Returns first wholesome authscheme found else null.

initialTasks

public void initialTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.

Overrides:
initialTasks in class Processor

finalTasks

public void finalTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

Overrides:
finalTasks in class Processor

cleanupHttp

protected void cleanupHttp()
Perform any final cleanup related to the HttpClient instance.


configureHttp

protected void configureHttp()
                      throws java.lang.RuntimeException
Throws:
java.lang.RuntimeException

loadCookies

public void loadCookies(java.lang.String cookiesFile)
Load cookies from a file before the first fetch.

The file is a text file in the Netscape's 'cookies.txt' file format.
Example entry of cookies.txt file:

www.archive.org FALSE / FALSE 1074567117 details-visit texts-cralond

Each line has 7 tab-separated fields:

  • 1. DOMAIN: The domain that created and have access to the cookie value.
  • 2. FLAG: A TRUE or FALSE value indicating if hosts within the given domain can access the cookie value.
  • 3. PATH: The path within the domain that the cookie value is valid for.
  • 4. SECURE: A TRUE or FALSE value indicating if to use a secure connection to access the cookie value.
  • 5. EXPIRATION: The expiration time of the cookie value (unix style.)
  • 6. NAME: The name of the cookie value
  • 7. VALUE: The cookie value

    Parameters:
    cookiesFile - file in the Netscape's 'cookies.txt' format.

  • report

    public java.lang.String report()
    Description copied from class: Processor
    Compiles and returns a report (in human readable form) about the status of the processor. The processor's name (of implementing class) should always be included.

    Examples of stats declared would include:
    * Number of CrawlURIs handled.
    * Number of links extracted (for link extractors)
    etc.

    Overrides:
    report in class Processor
    Returns:
    A human readable report on the processor's state.

    loadCookies

    public void loadCookies()
    Load cookies from the file specified in the order file.

    The file is a text file in the Netscape's 'cookies.txt' file format.
    Example entry of cookies.txt file:

    www.archive.org FALSE / FALSE 1074567117 details-visit texts-cralond

    Each line has 7 tab-separated fields:

  • 1. DOMAIN: The domain that created and have access to the cookie value.
  • 2. FLAG: A TRUE or FALSE value indicating if hosts within the given domain can access the cookie value.
  • 3. PATH: The path within the domain that the cookie value is valid for.
  • 4. SECURE: A TRUE or FALSE value indicating if to use a secure connection to access the cookie value.
  • 5. EXPIRATION: The expiration time of the cookie value (unix style.)
  • 6. NAME: The name of the cookie value
  • 7. VALUE: The cookie value


  • saveCookies

    public void saveCookies()
    Saves cookies to the file specified in the order file. Output file is in the Netscape 'cookies.txt' format.


    saveCookies

    public void saveCookies(java.lang.String saveCookiesFile)
    Saves cookies to a file. Output file is in the Netscape 'cookies.txt' format.

    Parameters:
    saveCookiesFile - output file.

    listUsedFiles

    protected void listUsedFiles(java.util.List<java.lang.String> list)
    Description copied from class: ModuleType
    Those Modules that use files on disk should list them all when this method is called.

    Each file (as a string name with full path) should be added to the provided list.

    Modules that do not use any files can safely ignore this method.

    Overrides:
    listUsedFiles in class ModuleType
    Parameters:
    list - The list to add files to.

    getHttp

    protected org.apache.commons.httpclient.HttpClient getHttp()
    Returns:
    Returns the http instance.

    crawlStarted

    public void crawlStarted(java.lang.String message)
    Description copied from interface: CrawlStatusListener
    Called on crawl start.

    Specified by:
    crawlStarted in interface CrawlStatusListener
    Parameters:
    message - Start message.

    crawlCheckpoint

    public void crawlCheckpoint(java.io.File checkpointDir)
    Description copied from interface: CrawlStatusListener
    Called by CrawlController when checkpointing.

    Specified by:
    crawlCheckpoint in interface CrawlStatusListener
    Parameters:
    checkpointDir - Checkpoint dir. Write checkpoint state here.

    crawlEnding

    public void crawlEnding(java.lang.String sExitMessage)
    Description copied from interface: CrawlStatusListener
    Called when a CrawlController is ending a crawl (for any reason)

    Specified by:
    crawlEnding in interface CrawlStatusListener
    Parameters:
    sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
    See Also:
    CrawlJob

    crawlEnded

    public void crawlEnded(java.lang.String sExitMessage)
    Description copied from interface: CrawlStatusListener
    Called when a CrawlController has ended a crawl and is about to exit.

    Specified by:
    crawlEnded in interface CrawlStatusListener
    Parameters:
    sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
    See Also:
    CrawlJob

    crawlPausing

    public void crawlPausing(java.lang.String statusMessage)
    Description copied from interface: CrawlStatusListener
    Called when a CrawlController is going to be paused.

    Specified by:
    crawlPausing in interface CrawlStatusListener
    Parameters:
    statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience

    crawlPaused

    public void crawlPaused(java.lang.String statusMessage)
    Description copied from interface: CrawlStatusListener
    Called when a CrawlController is actually paused (all threads are idle).

    Specified by:
    crawlPaused in interface CrawlStatusListener
    Parameters:
    statusMessage - Should be CrawlJob.STATUS_PAUSED. Passed for convenience

    crawlResuming

    public void crawlResuming(java.lang.String statusMessage)
    Description copied from interface: CrawlStatusListener
    Called when a CrawlController is resuming a crawl that had been paused.

    Specified by:
    crawlResuming in interface CrawlStatusListener
    Parameters:
    statusMessage - Should be CrawlJob.STATUS_RUNNING. Passed for convenience


    Copyright © 2003-2011 Internet Archive. All Rights Reserved.