org.archive.crawler.frontier
Class WorkQueueFrontier

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.frontier.AbstractFrontier
                      extended by org.archive.crawler.frontier.WorkQueueFrontier
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, UriUniqFilter.HasUriReceiver, CrawlStatusListener, Frontier, Reporter
Direct Known Subclasses:
BdbFrontier

public abstract class WorkQueueFrontier
extends AbstractFrontier
implements FetchStatusCodes, CoreAttributeConstants, UriUniqFilter.HasUriReceiver, java.io.Serializable

A common Frontier base using several queues to hold pending URIs. Uses in-memory map of all known 'queues' inside a single database. Round-robins between all queues.

Author:
Gordon Mohr, Christian Kohlschuetter
See Also:
Serialized Form

Nested Class Summary
 class WorkQueueFrontier.WakeTask
           
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier
Frontier.FrontierGroup
 
Field Summary
static java.lang.String ALL_NONEMPTY
           
static java.lang.String ALL_QUEUES
           
protected  ObjectIdentityCache<java.lang.String,WorkQueue> allQueues
          All known queues.
protected  UriUniqFilter alreadyIncluded
          those UURIs which are already in-process (or processed), and thus should not be rescheduled
static java.lang.String ATTR_BALANCE_REPLENISH_AMOUNT
          amount to replenish budget on each activation (duty cycle)
static java.lang.String ATTR_COST_POLICY
          cost assignment policy to use (by class name)
static java.lang.String ATTR_ERROR_PENALTY_AMOUNT
          whether to hold queues INACTIVE until needed for throughput
static java.lang.String ATTR_HOLD_QUEUES
          whether to hold queues INACTIVE until needed for throughput
static java.lang.String ATTR_QUEUE_TOTAL_BUDGET
          total expenditure to allow a queue before 'retiring' it
static java.lang.String ATTR_SNOOZE_DEACTIVATE_MS
          When a snooze target for a queue is longer than this amount, and there are already ready queues, deactivate rather than snooze the current queue -- so other more responsive sites get a chance in active rotation.
static java.lang.String ATTR_TARGET_READY_QUEUES_BACKLOG
          target size of ready queues backlog
(package private)  java.lang.String[] AVAILABLE_COST_POLICIES
          all policies available to be chosen
protected static java.lang.Integer DEFAULT_BALANCE_REPLENISH_AMOUNT
           
protected static java.lang.String DEFAULT_COST_POLICY
           
protected static java.lang.Integer DEFAULT_ERROR_PENALTY_AMOUNT
           
protected static java.lang.Boolean DEFAULT_HOLD_QUEUES
           
protected static java.lang.Long DEFAULT_QUEUE_TOTAL_BUDGET
           
static java.lang.Long DEFAULT_SNOOZE_DEACTIVATE_MS
           
protected static java.lang.Integer DEFAULT_TARGET_READY_QUEUES_BACKLOG
           
protected  java.util.Queue<java.lang.String> inactiveQueues
          All 'inactive' queues, not yet in active rotation.
protected  org.apache.commons.collections.Bag inProcessQueues
          all per-class queues from whom a URI is outstanding
protected  WorkQueue longestActiveQueue
           
protected  WorkQueueFrontier.WakeTask nextWake
          Task for next wake
protected  java.util.concurrent.BlockingQueue<java.lang.String> readyClassQueues
          All per-class queues whose first item may be handed out.
protected  java.util.concurrent.Semaphore readyFiller
          single-thread access to ready-filling code
protected static java.lang.String[] REPORTS
           
protected  java.util.Queue<java.lang.String> retiredQueues
          'retired' queues, no longer considered for activation.
protected  java.util.SortedSet<WorkQueue> snoozedClassQueues
          All per-class queues held in snoozed state, sorted by wake time.
static java.lang.String STANDARD_REPORT
           
protected  int targetSizeForReadyQueues
          Target (minimum) size to keep readyClassQueues
protected  java.util.Timer wakeTimer
          Timer for tasks which wake head item of snoozedClassQueues
 
Fields inherited from class org.archive.crawler.frontier.AbstractFrontier
ACCEPTABLE_FORCE_QUEUE, ATTR_DELAY_FACTOR, ATTR_FORCE_QUEUE, ATTR_MAX_DELAY, ATTR_MAX_HOST_BANDWIDTH_USAGE, ATTR_MAX_OVERALL_BANDWIDTH_USAGE, ATTR_MAX_RETRIES, ATTR_MIN_DELAY, ATTR_PAUSE_AT_FINISH, ATTR_PAUSE_AT_START, ATTR_PREFERENCE_EMBED_HOPS, ATTR_QUEUE_ASSIGNMENT_POLICY, ATTR_RECOVERY_ENABLED, ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS, ATTR_RETRY_DELAY, ATTR_SOURCE_TAG_SEEDS, controller, DEFAULT_ATTR_RECOVERY_ENABLED, DEFAULT_DELAY_FACTOR, DEFAULT_FORCE_QUEUE, DEFAULT_MAX_DELAY, DEFAULT_MAX_HOST_BANDWIDTH_USAGE, DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE, DEFAULT_MAX_RETRIES, DEFAULT_MIN_DELAY, DEFAULT_PAUSE_AT_FINISH, DEFAULT_PAUSE_AT_START, DEFAULT_PREFERENCE_EMBED_HOPS, DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS, DEFAULT_RETRY_DELAY, DEFAULT_SOURCE_TAG_SEEDS, disregardedUriCount, failedFetchCount, IGNORED_SEEDS_FILENAME, lastMaxBandwidthKB, liveDisregardedUriCount, liveFailedFetchCount, liveQueuedUriCount, liveSucceededFetchCount, nextOrdinal, processedBytesAfterLastEmittedURI, queuedUriCount, shouldPause, shouldTerminate, succeededFetchCount, totalProcessedBytes
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.framework.Frontier
ATTR_NAME
 
Constructor Summary
WorkQueueFrontier(java.lang.String name, java.lang.String description)
          Create the CommonFrontier
 
Method Summary
protected  void appendQueueReports(java.io.PrintWriter w, java.util.Iterator<?> iterator, int total, int max)
          Append queue report to general Frontier report.
protected  CrawlURI asCrawlUri(CandidateURI caUri)
           
 long averageDepth()
           
protected abstract  void closeQueue()
           
 float congestionRatio()
           
 void considerIncluded(UURI u)
          Notify Frontier that it should consider the given UURI as if already scheduled.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
protected abstract  UriUniqFilter createAlreadyIncluded()
          Create a UriUniqFilter that will serve as record of already seen URIs.
 long deepestUri()
           
 void deleted(CrawlURI curi)
          Force logging, etc.
 long deleteURIs(java.lang.String uriMatch)
          Delete all scheduled URIs matching the given regex.
 long deleteURIs(java.lang.String uriMatch, java.lang.String queueMatch)
          Delete all scheduled URIs matching the given regex, in queues with names matching the second given regex.
 long discoveredUriCount()
          (non-Javadoc)
 void finished(CrawlURI curi)
          Note that the previously emitted CrawlURI has completed its processing (for now).
 void forceWakeQueues()
          Wake all queues as if we were at the end of time
protected  void forget(CrawlURI curi)
          Forget the given CrawlURI.
 Frontier.FrontierGroup getGroup(CrawlURI curi)
          Get the 'frontier group' (usually queue) for the given CrawlURI.
protected abstract  WorkQueue getQueueFor(CrawlURI curi)
          Return the work queue for the given CrawlURI's classKey.
protected abstract  WorkQueue getQueueFor(java.lang.String classKey)
          Return the work queue for the given classKey, or null if no such queue exists.
 java.lang.String[] getReports()
          Get an array of report names offered by this Reporter.
 void initialize(CrawlController c)
          Initializes the Frontier, given the supplied CrawlController.
protected abstract  void initQueue()
           
protected  void initQueuesOfQueues()
          Set up the various queues-of-queues used by the frontier.
 boolean isEmpty()
          Frontier is empty only if all queues are empty and no URIs are in-process
 void kickUpdate()
          Accomodate any changes in settings.
 CrawlURI next()
          Return the next CrawlURI to be processed (and presumably visited/fetched) by a a worker thread.
 void receive(CandidateURI caUri)
          Accept the given CandidateURI for scheduling, as it has passed the alreadyIncluded filter.
 void reportTo(java.lang.String name, java.io.PrintWriter writer)
          This method compiles a human readable report on the status of the frontier at the time of the call.
 void schedule(CandidateURI caUri)
          Arrange for the given CandidateURI to be visited, if it is not already scheduled/completed.
protected  void sendToQueue(CrawlURI curi)
          Send a CrawlURI to the appropriate subqueue.
 java.lang.String singleLineLegend()
          Return a legend for the single-line summary report as a String.
 void singleLineReportTo(java.io.PrintWriter w)
          Make a single-line summary report to the passed-in writer
(package private)  void wakeQueues()
          Wake any queues sitting in the snoozed queue whose time has come.
(package private)  void wakeQueuesAsIfAtTime(long nowish)
          Wake any queues sitting in the snoozed queue whose time has come.
protected abstract  boolean workQueueDataOnDisk()
          Returns true if the WorkQueue implementation of this Frontier stores its workload on disk instead of relying on serialization mechanisms.
 
Methods inherited from class org.archive.crawler.frontier.AbstractFrontier
applySpecialHandling, canonicalize, canonicalize, crawlCheckpoint, crawlEnding, crawlPaused, crawlPausing, crawlResuming, crawlStarted, decrementQueuedCount, disregardedUriCount, doJournalAdded, doJournalDisregarded, doJournalEmitted, doJournalFinishedFailure, doJournalFinishedSuccess, doJournalRescheduled, failedFetchCount, finishedUriCount, getClassKey, getFrontierJournal, getQueueAssignmentPolicy, getServer, importRecoverLog, incrementDisregardedUriCount, incrementFailedFetchCount, incrementQueuedUriCount, incrementQueuedUriCount, incrementSucceededFetchCount, isDisregarded, loadSeeds, log, logLocalizedErrors, needsRetrying, noteAboutToEmit, overMaxRetries, pause, politenessDelayFor, preNext, queuedUriCount, reportTo, retryDelayFor, saveIgnoredItems, scratchDirFor, singleLineReport, start, succeededFetchCount, tally, terminate, totalBytesWritten, unpause
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.archive.crawler.framework.Frontier
finalTasks, getInitialMarker, getURIsList
 

Field Detail

ATTR_SNOOZE_DEACTIVATE_MS

public static final java.lang.String ATTR_SNOOZE_DEACTIVATE_MS
When a snooze target for a queue is longer than this amount, and there are already ready queues, deactivate rather than snooze the current queue -- so other more responsive sites get a chance in active rotation. (As a result, queue's next try may be much further in the future than the snooze target delay.)

See Also:
Constant Field Values

DEFAULT_SNOOZE_DEACTIVATE_MS

public static java.lang.Long DEFAULT_SNOOZE_DEACTIVATE_MS

ATTR_HOLD_QUEUES

public static final java.lang.String ATTR_HOLD_QUEUES
whether to hold queues INACTIVE until needed for throughput

See Also:
Constant Field Values

DEFAULT_HOLD_QUEUES

protected static final java.lang.Boolean DEFAULT_HOLD_QUEUES

ATTR_BALANCE_REPLENISH_AMOUNT

public static final java.lang.String ATTR_BALANCE_REPLENISH_AMOUNT
amount to replenish budget on each activation (duty cycle)

See Also:
Constant Field Values

DEFAULT_BALANCE_REPLENISH_AMOUNT

protected static final java.lang.Integer DEFAULT_BALANCE_REPLENISH_AMOUNT

ATTR_ERROR_PENALTY_AMOUNT

public static final java.lang.String ATTR_ERROR_PENALTY_AMOUNT
whether to hold queues INACTIVE until needed for throughput

See Also:
Constant Field Values

DEFAULT_ERROR_PENALTY_AMOUNT

protected static final java.lang.Integer DEFAULT_ERROR_PENALTY_AMOUNT

ATTR_QUEUE_TOTAL_BUDGET

public static final java.lang.String ATTR_QUEUE_TOTAL_BUDGET
total expenditure to allow a queue before 'retiring' it

See Also:
Constant Field Values

DEFAULT_QUEUE_TOTAL_BUDGET

protected static final java.lang.Long DEFAULT_QUEUE_TOTAL_BUDGET

ATTR_COST_POLICY

public static final java.lang.String ATTR_COST_POLICY
cost assignment policy to use (by class name)

See Also:
Constant Field Values

DEFAULT_COST_POLICY

protected static final java.lang.String DEFAULT_COST_POLICY

ATTR_TARGET_READY_QUEUES_BACKLOG

public static final java.lang.String ATTR_TARGET_READY_QUEUES_BACKLOG
target size of ready queues backlog

See Also:
Constant Field Values

DEFAULT_TARGET_READY_QUEUES_BACKLOG

protected static final java.lang.Integer DEFAULT_TARGET_READY_QUEUES_BACKLOG

alreadyIncluded

protected transient UriUniqFilter alreadyIncluded
those UURIs which are already in-process (or processed), and thus should not be rescheduled


allQueues

protected transient ObjectIdentityCache<java.lang.String,WorkQueue> allQueues
All known queues.


readyClassQueues

protected java.util.concurrent.BlockingQueue<java.lang.String> readyClassQueues
All per-class queues whose first item may be handed out. Linked-list of keys for the queues.


targetSizeForReadyQueues

protected int targetSizeForReadyQueues
Target (minimum) size to keep readyClassQueues


readyFiller

protected transient java.util.concurrent.Semaphore readyFiller
single-thread access to ready-filling code


inactiveQueues

protected java.util.Queue<java.lang.String> inactiveQueues
All 'inactive' queues, not yet in active rotation. Linked-list of keys for the queues.


retiredQueues

protected java.util.Queue<java.lang.String> retiredQueues
'retired' queues, no longer considered for activation. Linked-list of keys for queues.


inProcessQueues

protected org.apache.commons.collections.Bag inProcessQueues
all per-class queues from whom a URI is outstanding


snoozedClassQueues

protected java.util.SortedSet<WorkQueue> snoozedClassQueues
All per-class queues held in snoozed state, sorted by wake time.


wakeTimer

protected transient java.util.Timer wakeTimer
Timer for tasks which wake head item of snoozedClassQueues


nextWake

protected transient WorkQueueFrontier.WakeTask nextWake
Task for next wake


longestActiveQueue

protected WorkQueue longestActiveQueue

AVAILABLE_COST_POLICIES

java.lang.String[] AVAILABLE_COST_POLICIES
all policies available to be chosen


STANDARD_REPORT

public static java.lang.String STANDARD_REPORT

ALL_NONEMPTY

public static java.lang.String ALL_NONEMPTY

ALL_QUEUES

public static java.lang.String ALL_QUEUES

REPORTS

protected static java.lang.String[] REPORTS
Constructor Detail

WorkQueueFrontier

public WorkQueueFrontier(java.lang.String name,
                         java.lang.String description)
Create the CommonFrontier

Parameters:
name -
description -
Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Initializes the Frontier, given the supplied CrawlController.

Specified by:
initialize in interface Frontier
Overrides:
initialize in class AbstractFrontier
Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.
See Also:
Frontier.initialize(org.archive.crawler.framework.CrawlController)

initQueuesOfQueues

protected void initQueuesOfQueues()
Set up the various queues-of-queues used by the frontier. Override in implementing subclasses to reduce or eliminate risk of queues growing without bound.


crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Overrides:
crawlEnded in class AbstractFrontier
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

createAlreadyIncluded

protected abstract UriUniqFilter createAlreadyIncluded()
                                                throws java.io.IOException
Create a UriUniqFilter that will serve as record of already seen URIs.

Returns:
A UURISet that will serve as a record of already seen URIs
Throws:
java.io.IOException

schedule

public void schedule(CandidateURI caUri)
Arrange for the given CandidateURI to be visited, if it is not already scheduled/completed.

Specified by:
schedule in interface Frontier
Parameters:
caUri - The URI to schedule.
See Also:
Frontier.schedule(org.archive.crawler.datamodel.CandidateURI)

receive

public void receive(CandidateURI caUri)
Accept the given CandidateURI for scheduling, as it has passed the alreadyIncluded filter. Choose a per-classKey queue and enqueue it. If this item has made an unready queue ready, place that queue on the readyClassQueues queue.

Specified by:
receive in interface UriUniqFilter.HasUriReceiver
Parameters:
caUri - CandidateURI.

asCrawlUri

protected CrawlURI asCrawlUri(CandidateURI caUri)
Overrides:
asCrawlUri in class AbstractFrontier

sendToQueue

protected void sendToQueue(CrawlURI curi)
Send a CrawlURI to the appropriate subqueue.

Parameters:
curi -

kickUpdate

public void kickUpdate()
Accomodate any changes in settings.

Specified by:
kickUpdate in interface Frontier
Overrides:
kickUpdate in class AbstractFrontier
See Also:
Frontier.kickUpdate()

getQueueFor

protected abstract WorkQueue getQueueFor(CrawlURI curi)
Return the work queue for the given CrawlURI's classKey. URIs are ordered and politeness-delayed within their 'class'. If the requested queue is not found, a new instance is created.

Parameters:
curi - CrawlURI to base queue on
Returns:
the found or created ClassKeyQueue

getQueueFor

protected abstract WorkQueue getQueueFor(java.lang.String classKey)
Return the work queue for the given classKey, or null if no such queue exists.

Parameters:
classKey - key to look for
Returns:
the found WorkQueue

next

public CrawlURI next()
              throws java.lang.InterruptedException,
                     EndedException
Return the next CrawlURI to be processed (and presumably visited/fetched) by a a worker thread. Relies on the readyClassQueues having been loaded with any work queues that are eligible to provide a URI.

Specified by:
next in interface Frontier
Returns:
next CrawlURI to be processed. Or null if none is available.
Throws:
java.lang.InterruptedException
EndedException
See Also:
Frontier.next()

wakeQueues

void wakeQueues()
Wake any queues sitting in the snoozed queue whose time has come.


wakeQueuesAsIfAtTime

void wakeQueuesAsIfAtTime(long nowish)
Wake any queues sitting in the snoozed queue whose time has come.


forceWakeQueues

public void forceWakeQueues()
Wake all queues as if we were at the end of time


finished

public void finished(CrawlURI curi)
Note that the previously emitted CrawlURI has completed its processing (for now). The CrawlURI may be scheduled to retry, if appropriate, and other related URIs may become eligible for release via the next next() call, as a result of finished(). (non-Javadoc)

Specified by:
finished in interface Frontier
Parameters:
curi - The URI that has finished processing.
See Also:
Frontier.finished(org.archive.crawler.datamodel.CrawlURI)

forget

protected void forget(CrawlURI curi)
Forget the given CrawlURI. This allows a new instance to be created in the future, if it is reencountered under different circumstances.

Parameters:
curi - The CrawlURI to forget

discoveredUriCount

public long discoveredUriCount()
(non-Javadoc)

Specified by:
discoveredUriCount in interface Frontier
Returns:
Number of discovered URIs.
See Also:
Frontier.discoveredUriCount()

deleteURIs

public long deleteURIs(java.lang.String uriMatch)
Delete all scheduled URIs matching the given regex.

Specified by:
deleteURIs in interface Frontier
Parameters:
match - regex of URIs to delete
Returns:
Number of items deleted.

deleteURIs

public long deleteURIs(java.lang.String uriMatch,
                       java.lang.String queueMatch)
Delete all scheduled URIs matching the given regex, in queues with names matching the second given regex.

Specified by:
deleteURIs in interface Frontier
Parameters:
uriMatch - regex of URIs to delete
queueMatch - regex of queues to affect, or null for all
Returns:
Number of items deleted.

getReports

public java.lang.String[] getReports()
Description copied from interface: Reporter
Get an array of report names offered by this Reporter. A name in brackets indicates a free-form String, in accordance with the informal description inside the brackets, may yield a useful report.

Specified by:
getReports in interface Reporter
Returns:
String array of report names, empty if there is only one report type

singleLineReportTo

public void singleLineReportTo(java.io.PrintWriter w)
Description copied from interface: Reporter
Make a single-line summary report to the passed-in writer

Specified by:
singleLineReportTo in interface Reporter
Parameters:
w - Where to write to.

singleLineLegend

public java.lang.String singleLineLegend()
Description copied from interface: Reporter
Return a legend for the single-line summary report as a String.

Specified by:
singleLineLegend in interface Reporter
Returns:
String single-line summary legend

reportTo

public void reportTo(java.lang.String name,
                     java.io.PrintWriter writer)
This method compiles a human readable report on the status of the frontier at the time of the call.

Specified by:
reportTo in interface Reporter
Parameters:
name - Name of report.
writer - Where to write to.

appendQueueReports

protected void appendQueueReports(java.io.PrintWriter w,
                                  java.util.Iterator<?> iterator,
                                  int total,
                                  int max)
Append queue report to general Frontier report.

Parameters:
w - StringBuffer to append to.
iterator - An iterator over
total -
max -

deleted

public void deleted(CrawlURI curi)
Force logging, etc. of operator- deleted CrawlURIs

Specified by:
deleted in interface Frontier
Parameters:
curi - Deleted CrawlURI.
See Also:
Frontier.deleted(org.archive.crawler.datamodel.CrawlURI)

considerIncluded

public void considerIncluded(UURI u)
Description copied from interface: Frontier
Notify Frontier that it should consider the given UURI as if already scheduled.

Specified by:
considerIncluded in interface Frontier
Parameters:
u - UURI instance to add to the Already Included set.

initQueue

protected abstract void initQueue()
                           throws java.io.IOException
Throws:
java.io.IOException

closeQueue

protected abstract void closeQueue()
                            throws java.io.IOException
Throws:
java.io.IOException

workQueueDataOnDisk

protected abstract boolean workQueueDataOnDisk()
Returns true if the WorkQueue implementation of this Frontier stores its workload on disk instead of relying on serialization mechanisms. TODO: rename! (this is a very misleading name) or kill (don't see any implementations that return false)

Returns:
a constant boolean value for this class/instance

getGroup

public Frontier.FrontierGroup getGroup(CrawlURI curi)
Description copied from interface: Frontier
Get the 'frontier group' (usually queue) for the given CrawlURI.

Specified by:
getGroup in interface Frontier
Parameters:
curi - CrawlURI to find matching group
Returns:
FrontierGroup for the CrawlURI

averageDepth

public long averageDepth()
Specified by:
averageDepth in interface Frontier

congestionRatio

public float congestionRatio()
Specified by:
congestionRatio in interface Frontier

deepestUri

public long deepestUri()
Specified by:
deepestUri in interface Frontier

isEmpty

public boolean isEmpty()
Description copied from class: AbstractFrontier
Frontier is empty only if all queues are empty and no URIs are in-process

Specified by:
isEmpty in interface Frontier
Overrides:
isEmpty in class AbstractFrontier
Returns:
True if queues are empty.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.