AbstractFrontier (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.frontier
Class AbstractFrontier

java.lang.Object
  javax.management.Attribute
      org.archive.crawler.settings.Type
          org.archive.crawler.settings.ComplexType
              org.archive.crawler.settings.ModuleType
                  org.archive.crawler.frontier.AbstractFrontier

All Implemented Interfaces:: java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, CrawlStatusListener, Frontier, Reporter

Direct Known Subclasses:: WorkQueueFrontier

public abstract class AbstractFrontier
extends ModuleType
implements CrawlStatusListener, Frontier, FetchStatusCodes, CoreAttributeConstants, java.io.Serializable
extends ModuleType
implements CrawlStatusListener, Frontier, FetchStatusCodes, CoreAttributeConstants, java.io.Serializable

Shared facilities for Frontier implementations.

Author:: gojomo
See Also:: Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
`ComplexType.MBeanAttributeInfoIterator`

Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier
`Frontier.FrontierGroup`

Field Summary
`protected static java.lang.String`	`ACCEPTABLE_FORCE_QUEUE`
`static java.lang.String`	`ATTR_DELAY_FACTOR` how many multiples of last fetch elapsed time to wait before recontacting same server
`static java.lang.String`	`ATTR_FORCE_QUEUE` queue assignment to force onto CrawlURIs; intended to be overridden
`static java.lang.String`	`ATTR_MAX_DELAY` never wait more than this long, regardless of multiple
`static java.lang.String`	`ATTR_MAX_HOST_BANDWIDTH_USAGE` maximum per-host bandwidth usage
`static java.lang.String`	`ATTR_MAX_OVERALL_BANDWIDTH_USAGE` maximum overall bandwidth usage
`static java.lang.String`	`ATTR_MAX_RETRIES` maximum times to emit a CrawlURI without final disposition
`static java.lang.String`	`ATTR_MIN_DELAY` always wait this long after one completion before recontacting same server, regardless of multiple
`static java.lang.String`	`ATTR_PAUSE_AT_FINISH` whether pause, rather than finish, when crawl appears done
`static java.lang.String`	`ATTR_PAUSE_AT_START` whether to pause at crawl start
`static java.lang.String`	`ATTR_PREFERENCE_EMBED_HOPS` number of hops of embeds (ERX) to bump to front of host queue
`static java.lang.String`	`ATTR_QUEUE_ASSIGNMENT_POLICY`
`protected static java.lang.String`	`ATTR_RECOVERY_ENABLED` Recover log on or off attribute.
`static java.lang.String`	`ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS` Whether to respect a 'Crawl-Delay' (in seconds) given in a site's robots.txt
`static java.lang.String`	`ATTR_RETRY_DELAY` for retryable problems, seconds to wait before a retry
`static java.lang.String`	`ATTR_SOURCE_TAG_SEEDS` whether to pause at crawl start
`protected CrawlController`	`controller`
`protected static java.lang.Boolean`	`DEFAULT_ATTR_RECOVERY_ENABLED`
`protected static java.lang.Float`	`DEFAULT_DELAY_FACTOR`
`protected static java.lang.String`	`DEFAULT_FORCE_QUEUE`
`protected static java.lang.Integer`	`DEFAULT_MAX_DELAY`
`protected static java.lang.Integer`	`DEFAULT_MAX_HOST_BANDWIDTH_USAGE`
`protected static java.lang.Integer`	`DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE`
`protected static java.lang.Integer`	`DEFAULT_MAX_RETRIES`
`protected static java.lang.Integer`	`DEFAULT_MIN_DELAY`
`protected static java.lang.Boolean`	`DEFAULT_PAUSE_AT_FINISH`
`protected static java.lang.Boolean`	`DEFAULT_PAUSE_AT_START`
`protected static java.lang.Integer`	`DEFAULT_PREFERENCE_EMBED_HOPS`
`protected static java.lang.Integer`	`DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS`
`protected static java.lang.Long`	`DEFAULT_RETRY_DELAY`
`protected static java.lang.Boolean`	`DEFAULT_SOURCE_TAG_SEEDS`
`protected long`	`disregardedUriCount`
`protected long`	`failedFetchCount`
`static java.lang.String`	`IGNORED_SEEDS_FILENAME` file collecting report of ignored seed-file entries (if any)
`protected int`	`lastMaxBandwidthKB`
`protected java.util.concurrent.atomic.AtomicLong`	`liveDisregardedUriCount` URIs that are disregarded (for example because of robot.txt rules
`protected java.util.concurrent.atomic.AtomicLong`	`liveFailedFetchCount`
`protected java.util.concurrent.atomic.AtomicLong`	`liveQueuedUriCount` total URIs queued to be visited
`protected java.util.concurrent.atomic.AtomicLong`	`liveSucceededFetchCount`
`protected java.util.concurrent.atomic.AtomicLong`	`nextOrdinal` ordinal numbers to assign to created CrawlURIs
`protected long`	`processedBytesAfterLastEmittedURI`
`protected long`	`queuedUriCount`
`protected boolean`	`shouldPause` should the frontier hold any threads asking for URIs?
`protected boolean`	`shouldTerminate` should the frontier send an EndedException to any threads asking for URIs?
`protected long`	`succeededFetchCount`
`protected long`	`totalProcessedBytes` Used when bandwidth constraint are used.

Fields inherited from class org.archive.crawler.settings.ComplexType
`definition, definitionMap`

Fields inherited from interface org.archive.crawler.framework.Frontier
`ATTR_NAME`

Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE

Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes

S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE

Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX

Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants

A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX

Constructor Summary
`AbstractFrontier(java.lang.String name, java.lang.String description)`

Method Summary
`protected void`	`applySpecialHandling(CrawlURI curi)` Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.
`protected CrawlURI`	`asCrawlUri(CandidateURI caUri)`
`protected java.lang.String`	`canonicalize(CandidateURI cauri)` Canonicalize passed CandidateURI.
`protected java.lang.String`	`canonicalize(UURI uuri)` Canonicalize passed uuri.
`void`	`crawlCheckpoint(java.io.File checkpointDir)` Called by `CrawlController` when checkpointing.
`void`	`crawlEnded(java.lang.String sExitMessage)` Called when a CrawlController has ended a crawl and is about to exit.
`void`	`crawlEnding(java.lang.String sExitMessage)` Called when a CrawlController is ending a crawl (for any reason)
`void`	`crawlPaused(java.lang.String statusMessage)` Called when a CrawlController is actually paused (all threads are idle).
`void`	`crawlPausing(java.lang.String statusMessage)` Called when a CrawlController is going to be paused.
`void`	`crawlResuming(java.lang.String statusMessage)` Called when a CrawlController is resuming a crawl that had been paused.
`void`	`crawlStarted(java.lang.String message)` Called on crawl start.
`protected void`	`decrementQueuedCount(long numberOfDeletes)` Note that a number of queued Uris have been deleted.
`long`	`disregardedUriCount()` Number of URIs that were scheduled at one point but have been disregarded.
`protected void`	`doJournalAdded(CrawlURI c)`
`protected void`	`doJournalDisregarded(CrawlURI c)`
`protected void`	`doJournalEmitted(CrawlURI c)`
`protected void`	`doJournalFinishedFailure(CrawlURI c)`
`protected void`	`doJournalFinishedSuccess(CrawlURI c)`
`protected void`	`doJournalRescheduled(CrawlURI c)`
`long`	`failedFetchCount()` (non-Javadoc)
`long`	`finishedUriCount()` (non-Javadoc)
`java.lang.String`	`getClassKey(CandidateURI cauri)`
`FrontierJournal`	`getFrontierJournal()`
`protected QueueAssignmentPolicy`	`getQueueAssignmentPolicy(CandidateURI cauri)`
`protected CrawlServer`	`getServer(CrawlURI curi)`
`void`	`importRecoverLog(java.lang.String pathToLog, boolean retainFailures)` Recover earlier state by reading a recovery log.
`protected void`	`incrementDisregardedUriCount()` Increment the running count of disregarded URIs.
`protected void`	`incrementFailedFetchCount()` Increment the running count of failed URIs.
`protected void`	`incrementQueuedUriCount()` Increment the running count of queued URIs.
`protected void`	`incrementQueuedUriCount(long increment)` Increment the running count of queued URIs.
`protected void`	`incrementSucceededFetchCount()` Increment the running count of successfully fetched URIs.
`void`	`initialize(CrawlController c)` Initialize the Frontier.
`protected boolean`	`isDisregarded(CrawlURI curi)`
`boolean`	`isEmpty()` Frontier is empty only if all queues are empty and no URIs are in-process
`void`	`kickUpdate()` Notify Frontier that it should consider updating configuration info that may have changed in external files.
`void`	`loadSeeds()` Load up the seeds.
`protected void`	`log(CrawlURI curi)` Log to the main crawl.log
`protected void`	`logLocalizedErrors(CrawlURI curi)` Take note of any processor-local errors that have been entered into the CrawlURI.
`protected boolean`	`needsRetrying(CrawlURI curi)` Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)
`protected void`	`noteAboutToEmit(CrawlURI curi, WorkQueue q)` Perform fixups on a CrawlURI about to be returned via next().
`protected boolean`	`overMaxRetries(CrawlURI curi)`
`void`	`pause()` Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
`protected long`	`politenessDelayFor(CrawlURI curi)` Update any scheduling structures with the new information in this CrawlURI.
`protected void`	`preNext(long now)`
`long`	`queuedUriCount()` (non-Javadoc)
`void`	`reportTo(java.io.PrintWriter writer)` Make a default report to the passed-in Writer.
`protected long`	`retryDelayFor(CrawlURI curi)` Return a suitable value to wait before retrying the given URI.
`static void`	`saveIgnoredItems(java.lang.String ignoredItems, java.io.File dir)` Dump ignored seed items (if any) to disk; delete file otherwise.
`protected java.io.File`	`scratchDirFor(java.lang.String key)` Utility method to return a scratch dir for the given key's temp files.
`java.lang.String`	`singleLineReport()` Return a short single-line summary report as a String.
`void`	`start()` Request that Frontier allow crawling to begin.
`long`	`succeededFetchCount()` (non-Javadoc)
`protected void`	`tally(CrawlURI curi, CrawlSubstats.Stage stage)` Report CrawlURI to each of the three 'substats' accumulators (group/queue, server, host) for a given stage.
`void`	`terminate()` Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
`long`	`totalBytesWritten()` Deprecated. misnomer; use StatisticsTracking figures instead
`void`	`unpause()` Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Methods inherited from class org.archive.crawler.settings.ModuleType
`addElement, listUsedFiles`

Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.ComplexType

addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.Type
`addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient`

Methods inherited from class javax.management.Attribute
`getName, hashCode`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Methods inherited from interface org.archive.crawler.framework.Frontier
`averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, deleteURIs, discoveredUriCount, finalTasks, finished, getGroup, getInitialMarker, getURIsList, next, schedule`

Methods inherited from interface org.archive.util.Reporter
`getReports, reportTo, singleLineLegend, singleLineReportTo`

Field Detail

controller

protected transient CrawlController controller

nextOrdinal

protected java.util.concurrent.atomic.AtomicLong nextOrdinal

ordinal numbers to assign to created CrawlURIs

shouldPause

protected boolean shouldPause

should the frontier hold any threads asking for URIs?

shouldTerminate

protected transient boolean shouldTerminate

should the frontier send an EndedException to any threads asking for URIs?

ATTR_DELAY_FACTOR

public static final java.lang.String ATTR_DELAY_FACTOR

how many multiples of last fetch elapsed time to wait before recontacting same server

See Also:: Constant Field Values

DEFAULT_DELAY_FACTOR

protected static final java.lang.Float DEFAULT_DELAY_FACTOR

ATTR_MIN_DELAY

public static final java.lang.String ATTR_MIN_DELAY

always wait this long after one completion before recontacting same server, regardless of multiple

See Also:: Constant Field Values

DEFAULT_MIN_DELAY

protected static final java.lang.Integer DEFAULT_MIN_DELAY

ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS

public static final java.lang.String ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS

Whether to respect a 'Crawl-Delay' (in seconds) given in a site's robots.txt

See Also:: Constant Field Values

DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS

protected static final java.lang.Integer DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS

ATTR_MAX_DELAY

public static final java.lang.String ATTR_MAX_DELAY

never wait more than this long, regardless of multiple

See Also:: Constant Field Values

DEFAULT_MAX_DELAY

protected static final java.lang.Integer DEFAULT_MAX_DELAY

ATTR_PREFERENCE_EMBED_HOPS

public static final java.lang.String ATTR_PREFERENCE_EMBED_HOPS

number of hops of embeds (ERX) to bump to front of host queue

See Also:: Constant Field Values

DEFAULT_PREFERENCE_EMBED_HOPS

protected static final java.lang.Integer DEFAULT_PREFERENCE_EMBED_HOPS

ATTR_MAX_HOST_BANDWIDTH_USAGE

public static final java.lang.String ATTR_MAX_HOST_BANDWIDTH_USAGE

maximum per-host bandwidth usage

See Also:: Constant Field Values

DEFAULT_MAX_HOST_BANDWIDTH_USAGE

protected static final java.lang.Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE

ATTR_MAX_OVERALL_BANDWIDTH_USAGE

public static final java.lang.String ATTR_MAX_OVERALL_BANDWIDTH_USAGE

maximum overall bandwidth usage

See Also:: Constant Field Values

DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE

protected static final java.lang.Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE

ATTR_RETRY_DELAY

public static final java.lang.String ATTR_RETRY_DELAY

for retryable problems, seconds to wait before a retry

See Also:: Constant Field Values

DEFAULT_RETRY_DELAY

protected static final java.lang.Long DEFAULT_RETRY_DELAY

ATTR_MAX_RETRIES

public static final java.lang.String ATTR_MAX_RETRIES

maximum times to emit a CrawlURI without final disposition

See Also:: Constant Field Values

DEFAULT_MAX_RETRIES

protected static final java.lang.Integer DEFAULT_MAX_RETRIES

ATTR_QUEUE_ASSIGNMENT_POLICY

public static final java.lang.String ATTR_QUEUE_ASSIGNMENT_POLICY

See Also:: Constant Field Values

ATTR_FORCE_QUEUE

public static final java.lang.String ATTR_FORCE_QUEUE

queue assignment to force onto CrawlURIs; intended to be overridden

See Also:: Constant Field Values

DEFAULT_FORCE_QUEUE

protected static final java.lang.String DEFAULT_FORCE_QUEUE

See Also:: Constant Field Values

ACCEPTABLE_FORCE_QUEUE

protected static final java.lang.String ACCEPTABLE_FORCE_QUEUE

See Also:: Constant Field Values

ATTR_PAUSE_AT_FINISH

public static final java.lang.String ATTR_PAUSE_AT_FINISH

whether pause, rather than finish, when crawl appears done

See Also:: Constant Field Values

DEFAULT_PAUSE_AT_FINISH

protected static final java.lang.Boolean DEFAULT_PAUSE_AT_FINISH

ATTR_PAUSE_AT_START

public static final java.lang.String ATTR_PAUSE_AT_START

whether to pause at crawl start

See Also:: Constant Field Values

DEFAULT_PAUSE_AT_START

protected static final java.lang.Boolean DEFAULT_PAUSE_AT_START

ATTR_SOURCE_TAG_SEEDS

public static final java.lang.String ATTR_SOURCE_TAG_SEEDS

whether to pause at crawl start

See Also:: Constant Field Values

DEFAULT_SOURCE_TAG_SEEDS

protected static final java.lang.Boolean DEFAULT_SOURCE_TAG_SEEDS

ATTR_RECOVERY_ENABLED

protected static final java.lang.String ATTR_RECOVERY_ENABLED

Recover log on or off attribute.

See Also:: Constant Field Values

DEFAULT_ATTR_RECOVERY_ENABLED

protected static final java.lang.Boolean DEFAULT_ATTR_RECOVERY_ENABLED

queuedUriCount

protected long queuedUriCount

succeededFetchCount

protected long succeededFetchCount

failedFetchCount

protected long failedFetchCount

disregardedUriCount

protected long disregardedUriCount

liveQueuedUriCount

protected transient java.util.concurrent.atomic.AtomicLong liveQueuedUriCount

total URIs queued to be visited

liveSucceededFetchCount

protected transient java.util.concurrent.atomic.AtomicLong liveSucceededFetchCount

liveFailedFetchCount

protected transient java.util.concurrent.atomic.AtomicLong liveFailedFetchCount

liveDisregardedUriCount

protected transient java.util.concurrent.atomic.AtomicLong liveDisregardedUriCount

URIs that are disregarded (for example because of robot.txt rules

totalProcessedBytes

protected long totalProcessedBytes

Used when bandwidth constraint are used.

processedBytesAfterLastEmittedURI

protected long processedBytesAfterLastEmittedURI

lastMaxBandwidthKB

protected int lastMaxBandwidthKB

IGNORED_SEEDS_FILENAME

public static final java.lang.String IGNORED_SEEDS_FILENAME

file collecting report of ignored seed-file entries (if any)

See Also:: Constant Field Values

Constructor Detail

AbstractFrontier

public AbstractFrontier(java.lang.String name,
                        java.lang.String description)

Parameters:: name - Name of this frontier.; description - Description for this frontier.

Method Detail

start

public void start()

Description copied from interface: Frontier

Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.

Specified by:: start in interface Frontier

pause

public void pause()

Description copied from interface: Frontier

Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.

Specified by:: pause in interface Frontier

unpause

public void unpause()

Description copied from interface: Frontier

Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Specified by:: unpause in interface Frontier

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException

Description copied from interface: Frontier

Initialize the Frontier.

This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.

Specified by:: initialize in interface Frontier

Parameters:: c - The CrawlController that created the Frontier.
Throws:: FatalConfigurationException - If provided settings are illegal or otherwise unusable.; java.io.IOException - If there is a problem reading settings or seeds file from disk.

terminate

public void terminate()

Description copied from interface: Frontier

Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.

Specified by:: terminate in interface Frontier

tally

protected void tally(CrawlURI curi,
                     CrawlSubstats.Stage stage)

Report CrawlURI to each of the three 'substats' accumulators (group/queue, server, host) for a given stage.

Parameters:: curi -; stage -

doJournalFinishedSuccess

protected void doJournalFinishedSuccess(CrawlURI c)

doJournalAdded

protected void doJournalAdded(CrawlURI c)

doJournalRescheduled

protected void doJournalRescheduled(CrawlURI c)

doJournalFinishedFailure

protected void doJournalFinishedFailure(CrawlURI c)

doJournalDisregarded

protected void doJournalDisregarded(CrawlURI c)

doJournalEmitted

protected void doJournalEmitted(CrawlURI c)

isEmpty

public boolean isEmpty()

Frontier is empty only if all queues are empty and no URIs are in-process

Specified by:: isEmpty in interface Frontier

Returns:: True if queues are empty.

incrementQueuedUriCount

protected void incrementQueuedUriCount()

Increment the running count of queued URIs.

incrementQueuedUriCount

protected void incrementQueuedUriCount(long increment)

Increment the running count of queued URIs. Synchronized because operations on longs are not atomic.

Parameters:: increment - amount to increment the queued count

decrementQueuedCount

protected void decrementQueuedCount(long numberOfDeletes)

Note that a number of queued Uris have been deleted.

Parameters:: numberOfDeletes -

queuedUriCount

public long queuedUriCount()

(non-Javadoc)

Specified by:: queuedUriCount in interface Frontier

Returns:: Number of queued URIs.
See Also:: Frontier.queuedUriCount()

finishedUriCount

public long finishedUriCount()

(non-Javadoc)

Specified by:: finishedUriCount in interface Frontier

Returns:: Number of finished URIs.
See Also:: Frontier.finishedUriCount()

incrementSucceededFetchCount

protected void incrementSucceededFetchCount()

Increment the running count of successfully fetched URIs.

succeededFetchCount

public long succeededFetchCount()

(non-Javadoc)

Specified by:: succeededFetchCount in interface Frontier

Returns:: Number of successfully processed URIs.
See Also:: Frontier.succeededFetchCount()

incrementFailedFetchCount

protected void incrementFailedFetchCount()

Increment the running count of failed URIs.

failedFetchCount

public long failedFetchCount()

(non-Javadoc)

Specified by:: failedFetchCount in interface Frontier

Returns:: Number of URIs that failed to process.
See Also:: Frontier.failedFetchCount()

incrementDisregardedUriCount

protected void incrementDisregardedUriCount()

Increment the running count of disregarded URIs. Synchronized because operations on longs are not atomic.

disregardedUriCount

public long disregardedUriCount()

Description copied from interface: Frontier

Number of URIs that were scheduled at one point but have been disregarded.

Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.

Specified by:: disregardedUriCount in interface Frontier

Returns:: The number of URIs that have been disregarded.

totalBytesWritten

public long totalBytesWritten()

Deprecated. misnomer; use StatisticsTracking figures instead

Description copied from interface: Frontier

Total number of bytes contained in all URIs that have been processed.

Specified by:: totalBytesWritten in interface Frontier

Returns:: The total amounts of bytes in all processed URIs.

loadSeeds

public void loadSeeds()

Load up the seeds. This method is called on initialize and inside in the crawlcontroller when it wants to force reloading of configuration.

Specified by:: loadSeeds in interface Frontier

See Also:: CrawlController.kickUpdate()

saveIgnoredItems

public static void saveIgnoredItems(java.lang.String ignoredItems,
                                    java.io.File dir)

Dump ignored seed items (if any) to disk; delete file otherwise. Static to allow non-derived sibling classes (frontiers not yet subclassed here) to reuse.

Parameters:: ignoredItems -; dir -

asCrawlUri

protected CrawlURI asCrawlUri(CandidateURI caUri)

preNext

protected void preNext(long now)
                throws java.lang.InterruptedException,
                       EndedException

Parameters:: now -
Throws:: java.lang.InterruptedException; EndedException

applySpecialHandling

protected void applySpecialHandling(CrawlURI curi)

Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.

Parameters:: curi -

noteAboutToEmit

protected void noteAboutToEmit(CrawlURI curi,
                               WorkQueue q)

Perform fixups on a CrawlURI about to be returned via next().

Parameters:: curi - CrawlURI about to be returned by next(); q - the queue from which the CrawlURI came

getServer

protected CrawlServer getServer(CrawlURI curi)

Parameters:: curi -
Returns:: the CrawlServer to be associated with this CrawlURI

retryDelayFor

protected long retryDelayFor(CrawlURI curi)

Return a suitable value to wait before retrying the given URI.

Parameters:: curi - CrawlURI to be retried
Returns:: millisecond delay before retry

politenessDelayFor

protected long politenessDelayFor(CrawlURI curi)

Update any scheduling structures with the new information in this CrawlURI. Chiefly means make necessary arrangements for no other URIs at the same host to be visited within the appropriate politeness window.

Parameters:: curi - The CrawlURI
Returns:: millisecond politeness delay

logLocalizedErrors

protected void logLocalizedErrors(CrawlURI curi)

Take note of any processor-local errors that have been entered into the CrawlURI.

Parameters:: curi -

scratchDirFor

protected java.io.File scratchDirFor(java.lang.String key)

Utility method to return a scratch dir for the given key's temp files. Every key gets its own subdir. To avoid having any one directory with thousands of files, there are also two levels of enclosing directory named by the least-significant hex digits of the key string's java hashcode.

Parameters:: key -
Returns:: File representing scratch directory

overMaxRetries

protected boolean overMaxRetries(CrawlURI curi)

importRecoverLog

public void importRecoverLog(java.lang.String pathToLog,
                             boolean retainFailures)
                      throws java.io.IOException

Description copied from interface: Frontier

Recover earlier state by reading a recovery log.

Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.

Specified by:: importRecoverLog in interface Frontier

Parameters:: pathToLog - The name (with full path) of the recover log.; retainFailures - If true, failures in log should count as having been included. (If false, failures will be ignored, meaning the corresponding URIs will be retried in the recovered crawl.)
Throws:: java.io.IOException - If problems occur reading the recover log.

kickUpdate

public void kickUpdate()

Description copied from interface: Frontier

Notify Frontier that it should consider updating configuration info that may have changed in external files.

Specified by:: kickUpdate in interface Frontier

log

protected void log(CrawlURI curi)

Log to the main crawl.log

Parameters:: curi -

isDisregarded

protected boolean isDisregarded(CrawlURI curi)

needsRetrying

protected boolean needsRetrying(CrawlURI curi)

Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)

Parameters:: curi - The CrawlURI to check
Returns:: True if we need to retry.

canonicalize

protected java.lang.String canonicalize(UURI uuri)

Canonicalize passed uuri. Its would be sweeter if this canonicalize function was encapsulated by that which it canonicalizes but because settings change with context -- i.e. there may be overrides in operation for a particular URI -- its not so easy; Each CandidateURI would need a reference to the settings system. That's awkward to pass in.

Parameters:: uuri - Candidate URI to canonicalize.
Returns:: Canonicalized version of passed uuri.

canonicalize

protected java.lang.String canonicalize(CandidateURI cauri)

Canonicalize passed CandidateURI. This method differs from canonicalize(UURI) in that it takes a look at the CandidateURI context possibly overriding any canonicalization effect if it could make us miss content. If canonicalization produces an URL that was 'alreadyseen', but the entry in the 'alreadyseen' database did nothing but redirect to the current URL, we won't get the current URL; we'll think we've already see it. Examples would be archive.org redirecting to www.archive.org or the inverse, www.netarkivet.net redirecting to netarkivet.net (assuming stripWWW rule enabled).

Note, this method under circumstance sets the forceFetch flag.

Parameters:: cauri - CandidateURI to examine.
Returns:: Canonicalized cacuri.

getClassKey

public java.lang.String getClassKey(CandidateURI cauri)

Specified by:: getClassKey in interface Frontier

Parameters:: cauri - CrawlURI we're to get a key for.
Returns:: a String token representing a queue

getQueueAssignmentPolicy

protected QueueAssignmentPolicy getQueueAssignmentPolicy(CandidateURI cauri)

getFrontierJournal

public FrontierJournal getFrontierJournal()

Specified by:: getFrontierJournal in interface Frontier

Returns:: RecoveryJournal instance. May be null.

crawlEnding

public void crawlEnding(java.lang.String sExitMessage)

Description copied from interface: CrawlStatusListener

Called when a CrawlController is ending a crawl (for any reason)

Specified by:: crawlEnding in interface CrawlStatusListener

Parameters:: sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:: CrawlJob

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)

Description copied from interface: CrawlStatusListener

Called when a CrawlController has ended a crawl and is about to exit.

Specified by:: crawlEnded in interface CrawlStatusListener

Parameters:: sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:: CrawlJob

crawlStarted

public void crawlStarted(java.lang.String message)

Description copied from interface: CrawlStatusListener

Called on crawl start.

Specified by:: crawlStarted in interface CrawlStatusListener

Parameters:: message - Start message.

crawlPausing

public void crawlPausing(java.lang.String statusMessage)

Description copied from interface: CrawlStatusListener

Called when a CrawlController is going to be paused.

Specified by:: crawlPausing in interface CrawlStatusListener

Parameters:: statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience

crawlPaused

public void crawlPaused(java.lang.String statusMessage)

Description copied from interface: CrawlStatusListener

Called when a CrawlController is actually paused (all threads are idle).

Specified by:: crawlPaused in interface CrawlStatusListener

Parameters:: statusMessage - Should be CrawlJob.STATUS_PAUSED. Passed for convenience

crawlResuming

public void crawlResuming(java.lang.String statusMessage)

Description copied from interface: CrawlStatusListener

Called when a CrawlController is resuming a crawl that had been paused.

Specified by:: crawlResuming in interface CrawlStatusListener

Parameters:: statusMessage - Should be CrawlJob.STATUS_RUNNING. Passed for convenience

crawlCheckpoint

public void crawlCheckpoint(java.io.File checkpointDir)
                     throws java.lang.Exception

Description copied from interface: CrawlStatusListener

Called by CrawlController when checkpointing.

Specified by:: crawlCheckpoint in interface CrawlStatusListener

Parameters:: checkpointDir - Checkpoint dir. Write checkpoint state here.
Throws:: java.lang.Exception - A fatal exception. Any exceptions that are let out of this checkpoint are assumed fatal and terminate further checkpoint processing.

singleLineReport

public java.lang.String singleLineReport()

Description copied from interface: Reporter

Return a short single-line summary report as a String.

Specified by:: singleLineReport in interface Reporter

Returns:: String single-line summary report

reportTo

public void reportTo(java.io.PrintWriter writer)

Description copied from interface: Reporter

Make a default report to the passed-in Writer. Should be equivalent to reportTo(null, writer)

Specified by:: reportTo in interface Reporter

Parameters:: writer - to receive report

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.frontier Class AbstractFrontier

controller

nextOrdinal

shouldPause

shouldTerminate

ATTR_DELAY_FACTOR

DEFAULT_DELAY_FACTOR

ATTR_MIN_DELAY

DEFAULT_MIN_DELAY

ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS

DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS

ATTR_MAX_DELAY

DEFAULT_MAX_DELAY

ATTR_PREFERENCE_EMBED_HOPS

DEFAULT_PREFERENCE_EMBED_HOPS

ATTR_MAX_HOST_BANDWIDTH_USAGE

DEFAULT_MAX_HOST_BANDWIDTH_USAGE

ATTR_MAX_OVERALL_BANDWIDTH_USAGE

DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE

ATTR_RETRY_DELAY

DEFAULT_RETRY_DELAY

ATTR_MAX_RETRIES

DEFAULT_MAX_RETRIES

ATTR_QUEUE_ASSIGNMENT_POLICY

ATTR_FORCE_QUEUE

DEFAULT_FORCE_QUEUE

ACCEPTABLE_FORCE_QUEUE

ATTR_PAUSE_AT_FINISH

DEFAULT_PAUSE_AT_FINISH

ATTR_PAUSE_AT_START

DEFAULT_PAUSE_AT_START

ATTR_SOURCE_TAG_SEEDS

DEFAULT_SOURCE_TAG_SEEDS

ATTR_RECOVERY_ENABLED

DEFAULT_ATTR_RECOVERY_ENABLED

queuedUriCount

succeededFetchCount

failedFetchCount

disregardedUriCount

liveQueuedUriCount

liveSucceededFetchCount

liveFailedFetchCount

liveDisregardedUriCount

totalProcessedBytes

processedBytesAfterLastEmittedURI

lastMaxBandwidthKB

IGNORED_SEEDS_FILENAME

AbstractFrontier

start

pause

unpause

initialize

terminate

tally

doJournalFinishedSuccess

doJournalAdded

doJournalRescheduled

doJournalFinishedFailure

doJournalDisregarded

doJournalEmitted

isEmpty

incrementQueuedUriCount

incrementQueuedUriCount

decrementQueuedCount

queuedUriCount

finishedUriCount

incrementSucceededFetchCount

succeededFetchCount

incrementFailedFetchCount

failedFetchCount

incrementDisregardedUriCount

disregardedUriCount

totalBytesWritten

loadSeeds

saveIgnoredItems

asCrawlUri

preNext

applySpecialHandling

noteAboutToEmit

getServer

org.archive.crawler.frontier
Class AbstractFrontier