org.archive.crawler.frontier
Class WorkQueue

java.lang.Object
  extended by org.archive.crawler.frontier.WorkQueue
All Implemented Interfaces:
java.io.Serializable, java.lang.Comparable, CrawlSubstats.HasCrawlSubstats, Frontier.FrontierGroup, Reporter
Direct Known Subclasses:
BdbWorkQueue

public abstract class WorkQueue
extends java.lang.Object
implements Frontier.FrontierGroup, java.lang.Comparable, java.io.Serializable, Reporter

A single queue of related URIs to visit, grouped by a classKey (typically "hostname:port" or similar)

Author:
gojomo, Christian Kohlschuetter
See Also:
Serialized Form

Field Summary
protected  java.lang.String classKey
          The classKey
(package private) static long serialVersionUID
           
protected  CrawlSubstats substats
          Substats for all CrawlURIs in this group
 
Constructor Summary
WorkQueue(java.lang.String pClassKey)
           
 
Method Summary
 void clearHeld()
          Clear isHeld to false
 int compareTo(java.lang.Object obj)
           
protected abstract  void deleteItem(WorkQueueFrontier frontier, CrawlURI item)
          Removes the given item from the queue.
 long deleteMatching(WorkQueueFrontier frontier, java.lang.String match)
          Delete URIs matching the given pattern from this queue.
protected abstract  long deleteMatchingFromQueue(WorkQueueFrontier frontier, java.lang.String match)
          Delete URIs matching the given pattern from this queue.
 void dequeue(WorkQueueFrontier frontier)
          Remove the peekItem from the queue and adjusts the count.
 void enqueue(WorkQueueFrontier frontier, CrawlURI curi)
          Add the given CrawlURI, noting its addition in running count.
 int expend(int amount)
          Decrease the internal running budget by the given amount.
 java.lang.String getClassKey()
           
 UURI getContextUURI(WorkQueueFrontier wqf)
           
 long getCount()
           
 long getPendingExpenditure()
          Return the tally of all URI costs currently inside this queue
 java.lang.String[] getReports()
          Get an array of report names offered by this Reporter.
 int getSessionBalance()
          Return current session 'activity budget balance'
 CrawlSubstats getSubstats()
           
 long getTotalBudget()
          Retrieve the total expenditure level allowed by this queue.
 long getTotalExpenditure()
          Return the tally of all expenditures from this queue (dequeued items)
 long getWakeTime()
           
 int incrementSessionBalance(int amount)
          Increase the internal running budget to be used before deactivating the queue
protected abstract  void insertItem(WorkQueueFrontier frontier, CrawlURI curi, boolean expectedPresent)
          Insert the given curi, whether it is already present or not.
 boolean isHeld()
          Whether the queue is already in a lifecycle stage -- such as ready, in-progress, snoozed -- and thus should not be redundantly inserted to readyClassQueues
 boolean isOverBudget()
          Check whether queue has temporarily or permanently exceeded its budget.
 boolean isRetired()
           
 void noteError(int penalty)
          Note an error and assess an extra penalty.
 CrawlURI peek(WorkQueueFrontier frontier)
          Return the topmost queue item -- and remember it, such that even later higher-priority inserts don't change it.
protected abstract  CrawlURI peekItem(WorkQueueFrontier frontier)
          Returns first item from queue (does not delete)
 int refund(int amount)
          A URI should not have been charged against queue (eg it was disregarded); return the amount expended
 void reportTo(java.io.PrintWriter writer)
          Make a default report to the passed-in Writer.
 void reportTo(java.lang.String name, java.io.PrintWriter writer)
          Make a report of the given name to the passed-in Writer, If null, give the default report.
protected  void resume(WorkQueueFrontier frontier)
          Resumes this WorkQueue.
 void setActive(WorkQueueFrontier frontier, boolean b)
           
 void setHeld()
          Set isHeld to true
 void setRetired(boolean b)
          Set the retired status of this queue.
 void setSessionBalance(int balance)
          Set the session 'activity budget balance' to the given value
 void setTotalBudget(long budget)
          Set the total expenditure level allowable before queue is considered inherently 'over-budget'.
 void setWakeTime(long l)
           
 java.lang.String singleLineLegend()
          Return a legend for the single-line summary report as a String.
 java.lang.String singleLineReport()
          Return a short single-line summary report as a String.
 void singleLineReportTo(java.io.PrintWriter writer)
          Make a single-line summary report to the passed-in writer
protected  void suspend(WorkQueueFrontier frontier)
          Suspends this WorkQueue.
 void unpeek()
          Forgive the peek, allowing a subsequent peek to return a different item.
 void update(WorkQueueFrontier frontier, CrawlURI curi)
          Update the given CrawlURI, which should already be present.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

static final long serialVersionUID
See Also:
Constant Field Values

classKey

protected final java.lang.String classKey
The classKey


substats

protected CrawlSubstats substats
Substats for all CrawlURIs in this group

Constructor Detail

WorkQueue

public WorkQueue(java.lang.String pClassKey)
Method Detail

deleteMatching

public long deleteMatching(WorkQueueFrontier frontier,
                           java.lang.String match)
Delete URIs matching the given pattern from this queue.

Parameters:
frontier -
match -
Returns:
count of deleted URIs

enqueue

public void enqueue(WorkQueueFrontier frontier,
                    CrawlURI curi)
Add the given CrawlURI, noting its addition in running count. (It should not already be present.)

Parameters:
frontier - Work queues manager.
curi - CrawlURI to insert.

peek

public CrawlURI peek(WorkQueueFrontier frontier)
Return the topmost queue item -- and remember it, such that even later higher-priority inserts don't change it. TODO: evaluate if this is really necessary

Parameters:
frontier - Work queues manager
Returns:
topmost queue item, or null

dequeue

public void dequeue(WorkQueueFrontier frontier)
Remove the peekItem from the queue and adjusts the count.

Parameters:
frontier - Work queues manager.

setSessionBalance

public void setSessionBalance(int balance)
Set the session 'activity budget balance' to the given value

Parameters:
balance - to use

getSessionBalance

public int getSessionBalance()
Return current session 'activity budget balance'

Returns:
session balance

setTotalBudget

public void setTotalBudget(long budget)
Set the total expenditure level allowable before queue is considered inherently 'over-budget'.

Parameters:
budget -

getTotalBudget

public long getTotalBudget()
Retrieve the total expenditure level allowed by this queue.

Returns:
the queues total budget

isOverBudget

public boolean isOverBudget()
Check whether queue has temporarily or permanently exceeded its budget.

Returns:
true if queue is over its set budget(s)

getTotalExpenditure

public long getTotalExpenditure()
Return the tally of all expenditures from this queue (dequeued items)

Returns:
total amount expended on this queue

getPendingExpenditure

public long getPendingExpenditure()
Return the tally of all URI costs currently inside this queue

Returns:
total amount expended on this queue

incrementSessionBalance

public int incrementSessionBalance(int amount)
Increase the internal running budget to be used before deactivating the queue

Parameters:
amount - amount to increment
Returns:
updated budget value

expend

public int expend(int amount)
Decrease the internal running budget by the given amount.

Parameters:
amount - tp decrement
Returns:
updated budget value

refund

public int refund(int amount)
A URI should not have been charged against queue (eg it was disregarded); return the amount expended

Parameters:
amount - to return
Returns:
updated budget value

noteError

public void noteError(int penalty)
Note an error and assess an extra penalty.

Parameters:
penalty - additional amount to deduct

setWakeTime

public void setWakeTime(long l)
Parameters:
l -

getWakeTime

public long getWakeTime()
Returns:
wakeTime

getClassKey

public java.lang.String getClassKey()
Returns:
classKey, the 'identifier', for this queue.

clearHeld

public void clearHeld()
Clear isHeld to false


isHeld

public boolean isHeld()
Whether the queue is already in a lifecycle stage -- such as ready, in-progress, snoozed -- and thus should not be redundantly inserted to readyClassQueues

Returns:
isHeld

setHeld

public void setHeld()
Set isHeld to true


unpeek

public void unpeek()
Forgive the peek, allowing a subsequent peek to return a different item.


compareTo

public final int compareTo(java.lang.Object obj)
Specified by:
compareTo in interface java.lang.Comparable

update

public void update(WorkQueueFrontier frontier,
                   CrawlURI curi)
Update the given CrawlURI, which should already be present. (This is not checked.) Equivalent to an enqueue without affecting the count.

Parameters:
frontier - Work queues manager.
curi - CrawlURI to update.

getCount

public long getCount()
Returns:
Returns the count.

insertItem

protected abstract void insertItem(WorkQueueFrontier frontier,
                                   CrawlURI curi,
                                   boolean expectedPresent)
                            throws java.io.IOException
Insert the given curi, whether it is already present or not. Hook for subclasses.

Parameters:
frontier - WorkQueueFrontier.
curi - CrawlURI to insert.
Throws:
java.io.IOException - if there was a problem while inserting the item

deleteMatchingFromQueue

protected abstract long deleteMatchingFromQueue(WorkQueueFrontier frontier,
                                                java.lang.String match)
                                         throws java.io.IOException
Delete URIs matching the given pattern from this queue.

Parameters:
frontier - WorkQueues manager.
match - the pattern to match
Returns:
count of deleted URIs
Throws:
java.io.IOException - if there was a problem while deleting

deleteItem

protected abstract void deleteItem(WorkQueueFrontier frontier,
                                   CrawlURI item)
                            throws java.io.IOException
Removes the given item from the queue. This is only used to remove the first item in the queue, so it is not necessary to implement a random-access queue.

Parameters:
frontier - Work queues manager.
Throws:
java.io.IOException - if there was a problem while deleting the item

peekItem

protected abstract CrawlURI peekItem(WorkQueueFrontier frontier)
                              throws java.io.IOException
Returns first item from queue (does not delete)

Returns:
The peeked item, or null
Throws:
java.io.IOException - if there was a problem while peeking

suspend

protected void suspend(WorkQueueFrontier frontier)
                throws java.io.IOException
Suspends this WorkQueue. Closes all connections to resources etc.

Parameters:
frontier -
Throws:
java.io.IOException

resume

protected void resume(WorkQueueFrontier frontier)
               throws java.io.IOException
Resumes this WorkQueue. Eventually opens connections to resources etc.

Parameters:
frontier -
Throws:
java.io.IOException

setActive

public void setActive(WorkQueueFrontier frontier,
                      boolean b)

getReports

public java.lang.String[] getReports()
Description copied from interface: Reporter
Get an array of report names offered by this Reporter. A name in brackets indicates a free-form String, in accordance with the informal description inside the brackets, may yield a useful report.

Specified by:
getReports in interface Reporter
Returns:
String array of report names, empty if there is only one report type

reportTo

public void reportTo(java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a default report to the passed-in Writer. Should be equivalent to reportTo(null, writer)

Specified by:
reportTo in interface Reporter
Parameters:
writer - to receive report

singleLineReportTo

public void singleLineReportTo(java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a single-line summary report to the passed-in writer

Specified by:
singleLineReportTo in interface Reporter
Parameters:
writer - to receive report

singleLineLegend

public java.lang.String singleLineLegend()
Description copied from interface: Reporter
Return a legend for the single-line summary report as a String.

Specified by:
singleLineLegend in interface Reporter
Returns:
String single-line summary legend

singleLineReport

public java.lang.String singleLineReport()
Description copied from interface: Reporter
Return a short single-line summary report as a String.

Specified by:
singleLineReport in interface Reporter
Returns:
String single-line summary report

reportTo

public void reportTo(java.lang.String name,
                     java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a report of the given name to the passed-in Writer, If null, give the default report.

Specified by:
reportTo in interface Reporter
Parameters:
writer -
Throws:
java.io.IOException

getSubstats

public CrawlSubstats getSubstats()
Specified by:
getSubstats in interface CrawlSubstats.HasCrawlSubstats

setRetired

public void setRetired(boolean b)
Set the retired status of this queue.

Parameters:
b - new value for retired status

isRetired

public boolean isRetired()

getContextUURI

public UURI getContextUURI(WorkQueueFrontier wqf)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.