|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javax.management.Attribute org.archive.crawler.settings.Type org.archive.crawler.settings.ComplexType org.archive.crawler.settings.ModuleType org.archive.crawler.frontier.AbstractFrontier org.archive.crawler.frontier.WorkQueueFrontier
public abstract class WorkQueueFrontier
A common Frontier base using several queues to hold pending URIs. Uses in-memory map of all known 'queues' inside a single database. Round-robins between all queues.
Nested Class Summary | |
---|---|
class |
WorkQueueFrontier.WakeTask
|
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier |
---|
Frontier.FrontierGroup |
Field Summary | |
---|---|
static java.lang.String |
ALL_NONEMPTY
|
static java.lang.String |
ALL_QUEUES
|
protected ObjectIdentityCache<java.lang.String,WorkQueue> |
allQueues
All known queues. |
protected UriUniqFilter |
alreadyIncluded
those UURIs which are already in-process (or processed), and thus should not be rescheduled |
static java.lang.String |
ATTR_BALANCE_REPLENISH_AMOUNT
amount to replenish budget on each activation (duty cycle) |
static java.lang.String |
ATTR_COST_POLICY
cost assignment policy to use (by class name) |
static java.lang.String |
ATTR_ERROR_PENALTY_AMOUNT
whether to hold queues INACTIVE until needed for throughput |
static java.lang.String |
ATTR_HOLD_QUEUES
whether to hold queues INACTIVE until needed for throughput |
static java.lang.String |
ATTR_QUEUE_TOTAL_BUDGET
total expenditure to allow a queue before 'retiring' it |
static java.lang.String |
ATTR_SNOOZE_DEACTIVATE_MS
When a snooze target for a queue is longer than this amount, and there are already ready queues, deactivate rather than snooze the current queue -- so other more responsive sites get a chance in active rotation. |
static java.lang.String |
ATTR_TARGET_READY_QUEUES_BACKLOG
target size of ready queues backlog |
(package private) java.lang.String[] |
AVAILABLE_COST_POLICIES
all policies available to be chosen |
protected static java.lang.Integer |
DEFAULT_BALANCE_REPLENISH_AMOUNT
|
protected static java.lang.String |
DEFAULT_COST_POLICY
|
protected static java.lang.Integer |
DEFAULT_ERROR_PENALTY_AMOUNT
|
protected static java.lang.Boolean |
DEFAULT_HOLD_QUEUES
|
protected static java.lang.Long |
DEFAULT_QUEUE_TOTAL_BUDGET
|
static java.lang.Long |
DEFAULT_SNOOZE_DEACTIVATE_MS
|
protected static java.lang.Integer |
DEFAULT_TARGET_READY_QUEUES_BACKLOG
|
protected java.util.Queue<java.lang.String> |
inactiveQueues
All 'inactive' queues, not yet in active rotation. |
protected org.apache.commons.collections.Bag |
inProcessQueues
all per-class queues from whom a URI is outstanding |
protected WorkQueue |
longestActiveQueue
|
protected WorkQueueFrontier.WakeTask |
nextWake
Task for next wake |
protected java.util.concurrent.BlockingQueue<java.lang.String> |
readyClassQueues
All per-class queues whose first item may be handed out. |
protected java.util.concurrent.Semaphore |
readyFiller
single-thread access to ready-filling code |
protected static java.lang.String[] |
REPORTS
|
protected java.util.Queue<java.lang.String> |
retiredQueues
'retired' queues, no longer considered for activation. |
protected java.util.SortedSet<WorkQueue> |
snoozedClassQueues
All per-class queues held in snoozed state, sorted by wake time. |
static java.lang.String |
STANDARD_REPORT
|
protected int |
targetSizeForReadyQueues
Target (minimum) size to keep readyClassQueues |
protected java.util.Timer |
wakeTimer
Timer for tasks which wake head item of snoozedClassQueues |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Fields inherited from interface org.archive.crawler.framework.Frontier |
---|
ATTR_NAME |
Constructor Summary | |
---|---|
WorkQueueFrontier(java.lang.String name,
java.lang.String description)
Create the CommonFrontier |
Method Summary | |
---|---|
protected void |
appendQueueReports(java.io.PrintWriter w,
java.util.Iterator<?> iterator,
int total,
int max)
Append queue report to general Frontier report. |
protected CrawlURI |
asCrawlUri(CandidateURI caUri)
|
long |
averageDepth()
|
protected abstract void |
closeQueue()
|
float |
congestionRatio()
|
void |
considerIncluded(UURI u)
Notify Frontier that it should consider the given UURI as if already scheduled. |
void |
crawlEnded(java.lang.String sExitMessage)
Called when a CrawlController has ended a crawl and is about to exit. |
protected abstract UriUniqFilter |
createAlreadyIncluded()
Create a UriUniqFilter that will serve as record of already seen URIs. |
long |
deepestUri()
|
void |
deleted(CrawlURI curi)
Force logging, etc. |
long |
deleteURIs(java.lang.String uriMatch)
Delete all scheduled URIs matching the given regex. |
long |
deleteURIs(java.lang.String uriMatch,
java.lang.String queueMatch)
Delete all scheduled URIs matching the given regex, in queues with names matching the second given regex. |
long |
discoveredUriCount()
(non-Javadoc) |
void |
finished(CrawlURI curi)
Note that the previously emitted CrawlURI has completed its processing (for now). |
void |
forceWakeQueues()
Wake all queues as if we were at the end of time |
protected void |
forget(CrawlURI curi)
Forget the given CrawlURI. |
Frontier.FrontierGroup |
getGroup(CrawlURI curi)
Get the 'frontier group' (usually queue) for the given CrawlURI. |
protected abstract WorkQueue |
getQueueFor(CrawlURI curi)
Return the work queue for the given CrawlURI's classKey. |
protected abstract WorkQueue |
getQueueFor(java.lang.String classKey)
Return the work queue for the given classKey, or null if no such queue exists. |
java.lang.String[] |
getReports()
Get an array of report names offered by this Reporter. |
void |
initialize(CrawlController c)
Initializes the Frontier, given the supplied CrawlController. |
protected abstract void |
initQueue()
|
protected void |
initQueuesOfQueues()
Set up the various queues-of-queues used by the frontier. |
boolean |
isEmpty()
Frontier is empty only if all queues are empty and no URIs are in-process |
void |
kickUpdate()
Accomodate any changes in settings. |
CrawlURI |
next()
Return the next CrawlURI to be processed (and presumably visited/fetched) by a a worker thread. |
void |
receive(CandidateURI caUri)
Accept the given CandidateURI for scheduling, as it has passed the alreadyIncluded filter. |
void |
reportTo(java.lang.String name,
java.io.PrintWriter writer)
This method compiles a human readable report on the status of the frontier at the time of the call. |
void |
schedule(CandidateURI caUri)
Arrange for the given CandidateURI to be visited, if it is not already scheduled/completed. |
protected void |
sendToQueue(CrawlURI curi)
Send a CrawlURI to the appropriate subqueue. |
java.lang.String |
singleLineLegend()
Return a legend for the single-line summary report as a String. |
void |
singleLineReportTo(java.io.PrintWriter w)
Make a single-line summary report to the passed-in writer |
(package private) void |
wakeQueues()
Wake any queues sitting in the snoozed queue whose time has come. |
(package private) void |
wakeQueuesAsIfAtTime(long nowish)
Wake any queues sitting in the snoozed queue whose time has come. |
protected abstract boolean |
workQueueDataOnDisk()
Returns true if the WorkQueue implementation of this
Frontier stores its workload on disk instead of relying
on serialization mechanisms. |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement, listUsedFiles |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Methods inherited from interface org.archive.crawler.framework.Frontier |
---|
finalTasks, getInitialMarker, getURIsList |
Field Detail |
---|
public static final java.lang.String ATTR_SNOOZE_DEACTIVATE_MS
public static java.lang.Long DEFAULT_SNOOZE_DEACTIVATE_MS
public static final java.lang.String ATTR_HOLD_QUEUES
protected static final java.lang.Boolean DEFAULT_HOLD_QUEUES
public static final java.lang.String ATTR_BALANCE_REPLENISH_AMOUNT
protected static final java.lang.Integer DEFAULT_BALANCE_REPLENISH_AMOUNT
public static final java.lang.String ATTR_ERROR_PENALTY_AMOUNT
protected static final java.lang.Integer DEFAULT_ERROR_PENALTY_AMOUNT
public static final java.lang.String ATTR_QUEUE_TOTAL_BUDGET
protected static final java.lang.Long DEFAULT_QUEUE_TOTAL_BUDGET
public static final java.lang.String ATTR_COST_POLICY
protected static final java.lang.String DEFAULT_COST_POLICY
public static final java.lang.String ATTR_TARGET_READY_QUEUES_BACKLOG
protected static final java.lang.Integer DEFAULT_TARGET_READY_QUEUES_BACKLOG
protected transient UriUniqFilter alreadyIncluded
protected transient ObjectIdentityCache<java.lang.String,WorkQueue> allQueues
protected java.util.concurrent.BlockingQueue<java.lang.String> readyClassQueues
protected int targetSizeForReadyQueues
protected transient java.util.concurrent.Semaphore readyFiller
protected java.util.Queue<java.lang.String> inactiveQueues
protected java.util.Queue<java.lang.String> retiredQueues
protected org.apache.commons.collections.Bag inProcessQueues
protected java.util.SortedSet<WorkQueue> snoozedClassQueues
protected transient java.util.Timer wakeTimer
protected transient WorkQueueFrontier.WakeTask nextWake
protected WorkQueue longestActiveQueue
java.lang.String[] AVAILABLE_COST_POLICIES
public static java.lang.String STANDARD_REPORT
public static java.lang.String ALL_NONEMPTY
public static java.lang.String ALL_QUEUES
protected static java.lang.String[] REPORTS
Constructor Detail |
---|
public WorkQueueFrontier(java.lang.String name, java.lang.String description)
name
- description
- Method Detail |
---|
public void initialize(CrawlController c) throws FatalConfigurationException, java.io.IOException
initialize
in interface Frontier
initialize
in class AbstractFrontier
c
- The CrawlController that created the Frontier.
FatalConfigurationException
- If provided settings are illegal or
otherwise unusable.
java.io.IOException
- If there is a problem reading settings or seeds file
from disk.Frontier.initialize(org.archive.crawler.framework.CrawlController)
protected void initQueuesOfQueues()
public void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded
in interface CrawlStatusListener
crawlEnded
in class AbstractFrontier
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
protected abstract UriUniqFilter createAlreadyIncluded() throws java.io.IOException
java.io.IOException
public void schedule(CandidateURI caUri)
schedule
in interface Frontier
caUri
- The URI to schedule.Frontier.schedule(org.archive.crawler.datamodel.CandidateURI)
public void receive(CandidateURI caUri)
receive
in interface UriUniqFilter.HasUriReceiver
caUri
- CandidateURI.protected CrawlURI asCrawlUri(CandidateURI caUri)
asCrawlUri
in class AbstractFrontier
protected void sendToQueue(CrawlURI curi)
curi
- public void kickUpdate()
kickUpdate
in interface Frontier
kickUpdate
in class AbstractFrontier
Frontier.kickUpdate()
protected abstract WorkQueue getQueueFor(CrawlURI curi)
curi
- CrawlURI to base queue on
protected abstract WorkQueue getQueueFor(java.lang.String classKey)
classKey
- key to look for
public CrawlURI next() throws java.lang.InterruptedException, EndedException
next
in interface Frontier
java.lang.InterruptedException
EndedException
Frontier.next()
void wakeQueues()
void wakeQueuesAsIfAtTime(long nowish)
public void forceWakeQueues()
public void finished(CrawlURI curi)
finished
in interface Frontier
curi
- The URI that has finished processing.Frontier.finished(org.archive.crawler.datamodel.CrawlURI)
protected void forget(CrawlURI curi)
curi
- The CrawlURI to forgetpublic long discoveredUriCount()
discoveredUriCount
in interface Frontier
Frontier.discoveredUriCount()
public long deleteURIs(java.lang.String uriMatch)
deleteURIs
in interface Frontier
match
- regex of URIs to delete
public long deleteURIs(java.lang.String uriMatch, java.lang.String queueMatch)
deleteURIs
in interface Frontier
uriMatch
- regex of URIs to deletequeueMatch
- regex of queues to affect, or null for all
public java.lang.String[] getReports()
Reporter
getReports
in interface Reporter
public void singleLineReportTo(java.io.PrintWriter w)
Reporter
singleLineReportTo
in interface Reporter
w
- Where to write to.public java.lang.String singleLineLegend()
Reporter
singleLineLegend
in interface Reporter
public void reportTo(java.lang.String name, java.io.PrintWriter writer)
reportTo
in interface Reporter
name
- Name of report.writer
- Where to write to.protected void appendQueueReports(java.io.PrintWriter w, java.util.Iterator<?> iterator, int total, int max)
w
- StringBuffer to append to.iterator
- An iterator overtotal
- max
- public void deleted(CrawlURI curi)
deleted
in interface Frontier
curi
- Deleted CrawlURI.Frontier.deleted(org.archive.crawler.datamodel.CrawlURI)
public void considerIncluded(UURI u)
Frontier
considerIncluded
in interface Frontier
u
- UURI instance to add to the Already Included set.protected abstract void initQueue() throws java.io.IOException
java.io.IOException
protected abstract void closeQueue() throws java.io.IOException
java.io.IOException
protected abstract boolean workQueueDataOnDisk()
true
if the WorkQueue implementation of this
Frontier stores its workload on disk instead of relying
on serialization mechanisms.
TODO: rename! (this is a very misleading name) or kill (don't
see any implementations that return false)
public Frontier.FrontierGroup getGroup(CrawlURI curi)
Frontier
getGroup
in interface Frontier
curi
- CrawlURI to find matching group
public long averageDepth()
averageDepth
in interface Frontier
public float congestionRatio()
congestionRatio
in interface Frontier
public long deepestUri()
deepestUri
in interface Frontier
public boolean isEmpty()
AbstractFrontier
isEmpty
in interface Frontier
isEmpty
in class AbstractFrontier
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |