org.archive.crawler.frontier
Class AbstractFrontier

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.frontier.AbstractFrontier
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, CrawlStatusListener, Frontier, Reporter
Direct Known Subclasses:
WorkQueueFrontier

public abstract class AbstractFrontier
extends ModuleType
implements CrawlStatusListener, Frontier, FetchStatusCodes, CoreAttributeConstants, java.io.Serializable

Shared facilities for Frontier implementations.

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier
Frontier.FrontierGroup
 
Field Summary
protected static java.lang.String ACCEPTABLE_FORCE_QUEUE
           
static java.lang.String ATTR_DELAY_FACTOR
          how many multiples of last fetch elapsed time to wait before recontacting same server
static java.lang.String ATTR_FORCE_QUEUE
          queue assignment to force onto CrawlURIs; intended to be overridden
static java.lang.String ATTR_MAX_DELAY
          never wait more than this long, regardless of multiple
static java.lang.String ATTR_MAX_HOST_BANDWIDTH_USAGE
          maximum per-host bandwidth usage
static java.lang.String ATTR_MAX_OVERALL_BANDWIDTH_USAGE
          maximum overall bandwidth usage
static java.lang.String ATTR_MAX_RETRIES
          maximum times to emit a CrawlURI without final disposition
static java.lang.String ATTR_MIN_DELAY
          always wait this long after one completion before recontacting same server, regardless of multiple
static java.lang.String ATTR_PAUSE_AT_FINISH
          whether pause, rather than finish, when crawl appears done
static java.lang.String ATTR_PAUSE_AT_START
          whether to pause at crawl start
static java.lang.String ATTR_PREFERENCE_EMBED_HOPS
          number of hops of embeds (ERX) to bump to front of host queue
static java.lang.String ATTR_QUEUE_ASSIGNMENT_POLICY
           
protected static java.lang.String ATTR_RECOVERY_ENABLED
          Recover log on or off attribute.
static java.lang.String ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS
          Whether to respect a 'Crawl-Delay' (in seconds) given in a site's robots.txt
static java.lang.String ATTR_RETRY_DELAY
          for retryable problems, seconds to wait before a retry
static java.lang.String ATTR_SOURCE_TAG_SEEDS
          whether to pause at crawl start
protected  CrawlController controller
           
protected static java.lang.Boolean DEFAULT_ATTR_RECOVERY_ENABLED
           
protected static java.lang.Float DEFAULT_DELAY_FACTOR
           
protected static java.lang.String DEFAULT_FORCE_QUEUE
           
protected static java.lang.Integer DEFAULT_MAX_DELAY
           
protected static java.lang.Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE
           
protected static java.lang.Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
           
protected static java.lang.Integer DEFAULT_MAX_RETRIES
           
protected static java.lang.Integer DEFAULT_MIN_DELAY
           
protected static java.lang.Boolean DEFAULT_PAUSE_AT_FINISH
           
protected static java.lang.Boolean DEFAULT_PAUSE_AT_START
           
protected static java.lang.Integer DEFAULT_PREFERENCE_EMBED_HOPS
           
protected static java.lang.Integer DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS
           
protected static java.lang.Long DEFAULT_RETRY_DELAY
           
protected static java.lang.Boolean DEFAULT_SOURCE_TAG_SEEDS
           
protected  long disregardedUriCount
           
protected  long failedFetchCount
           
static java.lang.String IGNORED_SEEDS_FILENAME
          file collecting report of ignored seed-file entries (if any)
protected  int lastMaxBandwidthKB
           
protected  java.util.concurrent.atomic.AtomicLong liveDisregardedUriCount
          URIs that are disregarded (for example because of robot.txt rules
protected  java.util.concurrent.atomic.AtomicLong liveFailedFetchCount
           
protected  java.util.concurrent.atomic.AtomicLong liveQueuedUriCount
          total URIs queued to be visited
protected  java.util.concurrent.atomic.AtomicLong liveSucceededFetchCount
           
protected  java.util.concurrent.atomic.AtomicLong nextOrdinal
          ordinal numbers to assign to created CrawlURIs
protected  long processedBytesAfterLastEmittedURI
           
protected  long queuedUriCount
           
protected  boolean shouldPause
          should the frontier hold any threads asking for URIs?
protected  boolean shouldTerminate
          should the frontier send an EndedException to any threads asking for URIs?
protected  long succeededFetchCount
           
protected  long totalProcessedBytes
          Used when bandwidth constraint are used.
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.framework.Frontier
ATTR_NAME
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
AbstractFrontier(java.lang.String name, java.lang.String description)
           
 
Method Summary
protected  void applySpecialHandling(CrawlURI curi)
          Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.
protected  CrawlURI asCrawlUri(CandidateURI caUri)
           
protected  java.lang.String canonicalize(CandidateURI cauri)
          Canonicalize passed CandidateURI.
protected  java.lang.String canonicalize(UURI uuri)
          Canonicalize passed uuri.
 void crawlCheckpoint(java.io.File checkpointDir)
          Called by CrawlController when checkpointing.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
 void crawlEnding(java.lang.String sExitMessage)
          Called when a CrawlController is ending a crawl (for any reason)
 void crawlPaused(java.lang.String statusMessage)
          Called when a CrawlController is actually paused (all threads are idle).
 void crawlPausing(java.lang.String statusMessage)
          Called when a CrawlController is going to be paused.
 void crawlResuming(java.lang.String statusMessage)
          Called when a CrawlController is resuming a crawl that had been paused.
 void crawlStarted(java.lang.String message)
          Called on crawl start.
protected  void decrementQueuedCount(long numberOfDeletes)
          Note that a number of queued Uris have been deleted.
 long disregardedUriCount()
          Number of URIs that were scheduled at one point but have been disregarded.
protected  void doJournalAdded(CrawlURI c)
           
protected  void doJournalDisregarded(CrawlURI c)
           
protected  void doJournalEmitted(CrawlURI c)
           
protected  void doJournalFinishedFailure(CrawlURI c)
           
protected  void doJournalFinishedSuccess(CrawlURI c)
           
protected  void doJournalRescheduled(CrawlURI c)
           
 long failedFetchCount()
          (non-Javadoc)
 long finishedUriCount()
          (non-Javadoc)
 java.lang.String getClassKey(CandidateURI cauri)
           
 FrontierJournal getFrontierJournal()
           
protected  QueueAssignmentPolicy getQueueAssignmentPolicy(CandidateURI cauri)
           
protected  CrawlServer getServer(CrawlURI curi)
           
 void importRecoverLog(java.lang.String pathToLog, boolean retainFailures)
          Recover earlier state by reading a recovery log.
protected  void incrementDisregardedUriCount()
          Increment the running count of disregarded URIs.
protected  void incrementFailedFetchCount()
          Increment the running count of failed URIs.
protected  void incrementQueuedUriCount()
          Increment the running count of queued URIs.
protected  void incrementQueuedUriCount(long increment)
          Increment the running count of queued URIs.
protected  void incrementSucceededFetchCount()
          Increment the running count of successfully fetched URIs.
 void initialize(CrawlController c)
          Initialize the Frontier.
protected  boolean isDisregarded(CrawlURI curi)
           
 boolean isEmpty()
          Frontier is empty only if all queues are empty and no URIs are in-process
 void kickUpdate()
          Notify Frontier that it should consider updating configuration info that may have changed in external files.
 void loadSeeds()
          Load up the seeds.
protected  void log(CrawlURI curi)
          Log to the main crawl.log
protected  void logLocalizedErrors(CrawlURI curi)
          Take note of any processor-local errors that have been entered into the CrawlURI.
protected  boolean needsRetrying(CrawlURI curi)
          Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)
protected  void noteAboutToEmit(CrawlURI curi, WorkQueue q)
          Perform fixups on a CrawlURI about to be returned via next().
protected  boolean overMaxRetries(CrawlURI curi)
           
 void pause()
          Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
protected  long politenessDelayFor(CrawlURI curi)
          Update any scheduling structures with the new information in this CrawlURI.
protected  void preNext(long now)
           
 long queuedUriCount()
          (non-Javadoc)
 void reportTo(java.io.PrintWriter writer)
          Make a default report to the passed-in Writer.
protected  long retryDelayFor(CrawlURI curi)
          Return a suitable value to wait before retrying the given URI.
static void saveIgnoredItems(java.lang.String ignoredItems, java.io.File dir)
          Dump ignored seed items (if any) to disk; delete file otherwise.
protected  java.io.File scratchDirFor(java.lang.String key)
          Utility method to return a scratch dir for the given key's temp files.
 java.lang.String singleLineReport()
          Return a short single-line summary report as a String.
 void start()
          Request that Frontier allow crawling to begin.
 long succeededFetchCount()
          (non-Javadoc)
protected  void tally(CrawlURI curi, CrawlSubstats.Stage stage)
          Report CrawlURI to each of the three 'substats' accumulators (group/queue, server, host) for a given stage.
 void terminate()
          Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
 long totalBytesWritten()
          Deprecated. misnomer; use StatisticsTracking figures instead
 void unpause()
          Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.archive.crawler.framework.Frontier
averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, deleteURIs, discoveredUriCount, finalTasks, finished, getGroup, getInitialMarker, getURIsList, next, schedule
 
Methods inherited from interface org.archive.util.Reporter
getReports, reportTo, singleLineLegend, singleLineReportTo
 

Field Detail

controller

protected transient CrawlController controller

nextOrdinal

protected java.util.concurrent.atomic.AtomicLong nextOrdinal
ordinal numbers to assign to created CrawlURIs


shouldPause

protected boolean shouldPause
should the frontier hold any threads asking for URIs?


shouldTerminate

protected transient boolean shouldTerminate
should the frontier send an EndedException to any threads asking for URIs?


ATTR_DELAY_FACTOR

public static final java.lang.String ATTR_DELAY_FACTOR
how many multiples of last fetch elapsed time to wait before recontacting same server

See Also:
Constant Field Values

DEFAULT_DELAY_FACTOR

protected static final java.lang.Float DEFAULT_DELAY_FACTOR

ATTR_MIN_DELAY

public static final java.lang.String ATTR_MIN_DELAY
always wait this long after one completion before recontacting same server, regardless of multiple

See Also:
Constant Field Values

DEFAULT_MIN_DELAY

protected static final java.lang.Integer DEFAULT_MIN_DELAY

ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS

public static final java.lang.String ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS
Whether to respect a 'Crawl-Delay' (in seconds) given in a site's robots.txt

See Also:
Constant Field Values

DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS

protected static final java.lang.Integer DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS

ATTR_MAX_DELAY

public static final java.lang.String ATTR_MAX_DELAY
never wait more than this long, regardless of multiple

See Also:
Constant Field Values

DEFAULT_MAX_DELAY

protected static final java.lang.Integer DEFAULT_MAX_DELAY

ATTR_PREFERENCE_EMBED_HOPS

public static final java.lang.String ATTR_PREFERENCE_EMBED_HOPS
number of hops of embeds (ERX) to bump to front of host queue

See Also:
Constant Field Values

DEFAULT_PREFERENCE_EMBED_HOPS

protected static final java.lang.Integer DEFAULT_PREFERENCE_EMBED_HOPS

ATTR_MAX_HOST_BANDWIDTH_USAGE

public static final java.lang.String ATTR_MAX_HOST_BANDWIDTH_USAGE
maximum per-host bandwidth usage

See Also:
Constant Field Values

DEFAULT_MAX_HOST_BANDWIDTH_USAGE

protected static final java.lang.Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE

ATTR_MAX_OVERALL_BANDWIDTH_USAGE

public static final java.lang.String ATTR_MAX_OVERALL_BANDWIDTH_USAGE
maximum overall bandwidth usage

See Also:
Constant Field Values

DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE

protected static final java.lang.Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE

ATTR_RETRY_DELAY

public static final java.lang.String ATTR_RETRY_DELAY
for retryable problems, seconds to wait before a retry

See Also:
Constant Field Values

DEFAULT_RETRY_DELAY

protected static final java.lang.Long DEFAULT_RETRY_DELAY

ATTR_MAX_RETRIES

public static final java.lang.String ATTR_MAX_RETRIES
maximum times to emit a CrawlURI without final disposition

See Also:
Constant Field Values

DEFAULT_MAX_RETRIES

protected static final java.lang.Integer DEFAULT_MAX_RETRIES

ATTR_QUEUE_ASSIGNMENT_POLICY

public static final java.lang.String ATTR_QUEUE_ASSIGNMENT_POLICY
See Also:
Constant Field Values

ATTR_FORCE_QUEUE

public static final java.lang.String ATTR_FORCE_QUEUE
queue assignment to force onto CrawlURIs; intended to be overridden

See Also:
Constant Field Values

DEFAULT_FORCE_QUEUE

protected static final java.lang.String DEFAULT_FORCE_QUEUE
See Also:
Constant Field Values

ACCEPTABLE_FORCE_QUEUE

protected static final java.lang.String ACCEPTABLE_FORCE_QUEUE
See Also:
Constant Field Values

ATTR_PAUSE_AT_FINISH

public static final java.lang.String ATTR_PAUSE_AT_FINISH
whether pause, rather than finish, when crawl appears done

See Also:
Constant Field Values

DEFAULT_PAUSE_AT_FINISH

protected static final java.lang.Boolean DEFAULT_PAUSE_AT_FINISH

ATTR_PAUSE_AT_START

public static final java.lang.String ATTR_PAUSE_AT_START
whether to pause at crawl start

See Also:
Constant Field Values

DEFAULT_PAUSE_AT_START

protected static final java.lang.Boolean DEFAULT_PAUSE_AT_START

ATTR_SOURCE_TAG_SEEDS

public static final java.lang.String ATTR_SOURCE_TAG_SEEDS
whether to pause at crawl start

See Also:
Constant Field Values

DEFAULT_SOURCE_TAG_SEEDS

protected static final java.lang.Boolean DEFAULT_SOURCE_TAG_SEEDS

ATTR_RECOVERY_ENABLED

protected static final java.lang.String ATTR_RECOVERY_ENABLED
Recover log on or off attribute.

See Also:
Constant Field Values

DEFAULT_ATTR_RECOVERY_ENABLED

protected static final java.lang.Boolean DEFAULT_ATTR_RECOVERY_ENABLED

queuedUriCount

protected long queuedUriCount

succeededFetchCount

protected long succeededFetchCount

failedFetchCount

protected long failedFetchCount

disregardedUriCount

protected long disregardedUriCount

liveQueuedUriCount

protected transient java.util.concurrent.atomic.AtomicLong liveQueuedUriCount
total URIs queued to be visited


liveSucceededFetchCount

protected transient java.util.concurrent.atomic.AtomicLong liveSucceededFetchCount

liveFailedFetchCount

protected transient java.util.concurrent.atomic.AtomicLong liveFailedFetchCount

liveDisregardedUriCount

protected transient java.util.concurrent.atomic.AtomicLong liveDisregardedUriCount
URIs that are disregarded (for example because of robot.txt rules


totalProcessedBytes

protected long totalProcessedBytes
Used when bandwidth constraint are used.


processedBytesAfterLastEmittedURI

protected long processedBytesAfterLastEmittedURI

lastMaxBandwidthKB

protected int lastMaxBandwidthKB

IGNORED_SEEDS_FILENAME

public static final java.lang.String IGNORED_SEEDS_FILENAME
file collecting report of ignored seed-file entries (if any)

See Also:
Constant Field Values
Constructor Detail

AbstractFrontier

public AbstractFrontier(java.lang.String name,
                        java.lang.String description)
Parameters:
name - Name of this frontier.
description - Description for this frontier.
Method Detail

start

public void start()
Description copied from interface: Frontier
Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.

Specified by:
start in interface Frontier

pause

public void pause()
Description copied from interface: Frontier
Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.

Specified by:
pause in interface Frontier

unpause

public void unpause()
Description copied from interface: Frontier
Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Specified by:
unpause in interface Frontier

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Description copied from interface: Frontier
Initialize the Frontier.

This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.

Specified by:
initialize in interface Frontier
Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.

terminate

public void terminate()
Description copied from interface: Frontier
Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.

Specified by:
terminate in interface Frontier

tally

protected void tally(CrawlURI curi,
                     CrawlSubstats.Stage stage)
Report CrawlURI to each of the three 'substats' accumulators (group/queue, server, host) for a given stage.

Parameters:
curi -
stage -

doJournalFinishedSuccess

protected void doJournalFinishedSuccess(CrawlURI c)

doJournalAdded

protected void doJournalAdded(CrawlURI c)

doJournalRescheduled

protected void doJournalRescheduled(CrawlURI c)

doJournalFinishedFailure

protected void doJournalFinishedFailure(CrawlURI c)

doJournalDisregarded

protected void doJournalDisregarded(CrawlURI c)

doJournalEmitted

protected void doJournalEmitted(CrawlURI c)

isEmpty

public boolean isEmpty()
Frontier is empty only if all queues are empty and no URIs are in-process

Specified by:
isEmpty in interface Frontier
Returns:
True if queues are empty.

incrementQueuedUriCount

protected void incrementQueuedUriCount()
Increment the running count of queued URIs.


incrementQueuedUriCount

protected void incrementQueuedUriCount(long increment)
Increment the running count of queued URIs. Synchronized because operations on longs are not atomic.

Parameters:
increment - amount to increment the queued count

decrementQueuedCount

protected void decrementQueuedCount(long numberOfDeletes)
Note that a number of queued Uris have been deleted.

Parameters:
numberOfDeletes -

queuedUriCount

public long queuedUriCount()
(non-Javadoc)

Specified by:
queuedUriCount in interface Frontier
Returns:
Number of queued URIs.
See Also:
Frontier.queuedUriCount()

finishedUriCount

public long finishedUriCount()
(non-Javadoc)

Specified by:
finishedUriCount in interface Frontier
Returns:
Number of finished URIs.
See Also:
Frontier.finishedUriCount()

incrementSucceededFetchCount

protected void incrementSucceededFetchCount()
Increment the running count of successfully fetched URIs.


succeededFetchCount

public long succeededFetchCount()
(non-Javadoc)

Specified by:
succeededFetchCount in interface Frontier
Returns:
Number of successfully processed URIs.
See Also:
Frontier.succeededFetchCount()

incrementFailedFetchCount

protected void incrementFailedFetchCount()
Increment the running count of failed URIs.


failedFetchCount

public long failedFetchCount()
(non-Javadoc)

Specified by:
failedFetchCount in interface Frontier
Returns:
Number of URIs that failed to process.
See Also:
Frontier.failedFetchCount()

incrementDisregardedUriCount

protected void incrementDisregardedUriCount()
Increment the running count of disregarded URIs. Synchronized because operations on longs are not atomic.


disregardedUriCount

public long disregardedUriCount()
Description copied from interface: Frontier
Number of URIs that were scheduled at one point but have been disregarded.

Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.

Specified by:
disregardedUriCount in interface Frontier
Returns:
The number of URIs that have been disregarded.

totalBytesWritten

public long totalBytesWritten()
Deprecated. misnomer; use StatisticsTracking figures instead

Description copied from interface: Frontier
Total number of bytes contained in all URIs that have been processed.

Specified by:
totalBytesWritten in interface Frontier
Returns:
The total amounts of bytes in all processed URIs.

loadSeeds

public void loadSeeds()
Load up the seeds. This method is called on initialize and inside in the crawlcontroller when it wants to force reloading of configuration.

Specified by:
loadSeeds in interface Frontier
See Also:
CrawlController.kickUpdate()

saveIgnoredItems

public static void saveIgnoredItems(java.lang.String ignoredItems,
                                    java.io.File dir)
Dump ignored seed items (if any) to disk; delete file otherwise. Static to allow non-derived sibling classes (frontiers not yet subclassed here) to reuse.

Parameters:
ignoredItems -
dir -

asCrawlUri

protected CrawlURI asCrawlUri(CandidateURI caUri)

preNext

protected void preNext(long now)
                throws java.lang.InterruptedException,
                       EndedException
Parameters:
now -
Throws:
java.lang.InterruptedException
EndedException

applySpecialHandling

protected void applySpecialHandling(CrawlURI curi)
Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.

Parameters:
curi -

noteAboutToEmit

protected void noteAboutToEmit(CrawlURI curi,
                               WorkQueue q)
Perform fixups on a CrawlURI about to be returned via next().

Parameters:
curi - CrawlURI about to be returned by next()
q - the queue from which the CrawlURI came

getServer

protected CrawlServer getServer(CrawlURI curi)
Parameters:
curi -
Returns:
the CrawlServer to be associated with this CrawlURI

retryDelayFor

protected long retryDelayFor(CrawlURI curi)
Return a suitable value to wait before retrying the given URI.

Parameters:
curi - CrawlURI to be retried
Returns:
millisecond delay before retry

politenessDelayFor

protected long politenessDelayFor(CrawlURI curi)
Update any scheduling structures with the new information in this CrawlURI. Chiefly means make necessary arrangements for no other URIs at the same host to be visited within the appropriate politeness window.

Parameters:
curi - The CrawlURI
Returns:
millisecond politeness delay

logLocalizedErrors

protected void logLocalizedErrors(CrawlURI curi)
Take note of any processor-local errors that have been entered into the CrawlURI.

Parameters:
curi -

scratchDirFor

protected java.io.File scratchDirFor(java.lang.String key)
Utility method to return a scratch dir for the given key's temp files. Every key gets its own subdir. To avoid having any one directory with thousands of files, there are also two levels of enclosing directory named by the least-significant hex digits of the key string's java hashcode.

Parameters:
key -
Returns:
File representing scratch directory

overMaxRetries

protected boolean overMaxRetries(CrawlURI curi)

importRecoverLog

public void importRecoverLog(java.lang.String pathToLog,
                             boolean retainFailures)
                      throws java.io.IOException
Description copied from interface: Frontier
Recover earlier state by reading a recovery log.

Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.

Specified by:
importRecoverLog in interface Frontier
Parameters:
pathToLog - The name (with full path) of the recover log.
retainFailures - If true, failures in log should count as having been included. (If false, failures will be ignored, meaning the corresponding URIs will be retried in the recovered crawl.)
Throws:
java.io.IOException - If problems occur reading the recover log.

kickUpdate

public void kickUpdate()
Description copied from interface: Frontier
Notify Frontier that it should consider updating configuration info that may have changed in external files.

Specified by:
kickUpdate in interface Frontier

log

protected void log(CrawlURI curi)
Log to the main crawl.log

Parameters:
curi -

isDisregarded

protected boolean isDisregarded(CrawlURI curi)

needsRetrying

protected boolean needsRetrying(CrawlURI curi)
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)

Parameters:
curi - The CrawlURI to check
Returns:
True if we need to retry.

canonicalize

protected java.lang.String canonicalize(UURI uuri)
Canonicalize passed uuri. Its would be sweeter if this canonicalize function was encapsulated by that which it canonicalizes but because settings change with context -- i.e. there may be overrides in operation for a particular URI -- its not so easy; Each CandidateURI would need a reference to the settings system. That's awkward to pass in.

Parameters:
uuri - Candidate URI to canonicalize.
Returns:
Canonicalized version of passed uuri.

canonicalize

protected java.lang.String canonicalize(CandidateURI cauri)
Canonicalize passed CandidateURI. This method differs from canonicalize(UURI) in that it takes a look at the CandidateURI context possibly overriding any canonicalization effect if it could make us miss content. If canonicalization produces an URL that was 'alreadyseen', but the entry in the 'alreadyseen' database did nothing but redirect to the current URL, we won't get the current URL; we'll think we've already see it. Examples would be archive.org redirecting to www.archive.org or the inverse, www.netarkivet.net redirecting to netarkivet.net (assuming stripWWW rule enabled).

Note, this method under circumstance sets the forceFetch flag.

Parameters:
cauri - CandidateURI to examine.
Returns:
Canonicalized cacuri.

getClassKey

public java.lang.String getClassKey(CandidateURI cauri)
Specified by:
getClassKey in interface Frontier
Parameters:
cauri - CrawlURI we're to get a key for.
Returns:
a String token representing a queue

getQueueAssignmentPolicy

protected QueueAssignmentPolicy getQueueAssignmentPolicy(CandidateURI cauri)

getFrontierJournal

public FrontierJournal getFrontierJournal()
Specified by:
getFrontierJournal in interface Frontier
Returns:
RecoveryJournal instance. May be null.

crawlEnding

public void crawlEnding(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is ending a crawl (for any reason)

Specified by:
crawlEnding in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlStarted

public void crawlStarted(java.lang.String message)
Description copied from interface: CrawlStatusListener
Called on crawl start.

Specified by:
crawlStarted in interface CrawlStatusListener
Parameters:
message - Start message.

crawlPausing

public void crawlPausing(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is going to be paused.

Specified by:
crawlPausing in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience

crawlPaused

public void crawlPaused(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is actually paused (all threads are idle).

Specified by:
crawlPaused in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_PAUSED. Passed for convenience

crawlResuming

public void crawlResuming(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is resuming a crawl that had been paused.

Specified by:
crawlResuming in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_RUNNING. Passed for convenience

crawlCheckpoint

public void crawlCheckpoint(java.io.File checkpointDir)
                     throws java.lang.Exception
Description copied from interface: CrawlStatusListener
Called by CrawlController when checkpointing.

Specified by:
crawlCheckpoint in interface CrawlStatusListener
Parameters:
checkpointDir - Checkpoint dir. Write checkpoint state here.
Throws:
java.lang.Exception - A fatal exception. Any exceptions that are let out of this checkpoint are assumed fatal and terminate further checkpoint processing.

singleLineReport

public java.lang.String singleLineReport()
Description copied from interface: Reporter
Return a short single-line summary report as a String.

Specified by:
singleLineReport in interface Reporter
Returns:
String single-line summary report

reportTo

public void reportTo(java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a default report to the passed-in Writer. Should be equivalent to reportTo(null, writer)

Specified by:
reportTo in interface Reporter
Parameters:
writer - to receive report


Copyright © 2003-2011 Internet Archive. All Rights Reserved.