org.archive.crawler.frontier
Class AdaptiveRevisitHostQueue

java.lang.Object
  extended by org.archive.crawler.frontier.AdaptiveRevisitHostQueue
All Implemented Interfaces:
CoreAttributeConstants, CrawlSubstats.HasCrawlSubstats, Frontier.FrontierGroup, AdaptiveRevisitAttributeConstants

public class AdaptiveRevisitHostQueue
extends java.lang.Object
implements AdaptiveRevisitAttributeConstants, Frontier.FrontierGroup

A priority based queue of CrawlURIs. Each queue should represent one host (although this is not enforced in this class). Items are ordered by the scheduling directive and time of next processing (in that order) and also indexed by the URI.

The HQ does no calculations on the 'time of next processing.' It always relies on values already set on the CrawlURI.

Note: Class is not 'thread safe.' In multi threaded environment the caller must ensure that two threads do not make overlapping calls.

Any BDB DatabaseException will be converted to an IOException by public methods. This includes preserving the original stacktrace, in favor of the one created for the IOException, so that the true source of the exception is not lost.

Author:
Kristinn Sigurdsson

Field Summary
protected  com.sleepycat.bind.serial.StoredClassCatalog classCatalog
          For BDB serialization of objects
protected  com.sleepycat.bind.EntryBinding crawlURIBinding
          A binding for the CrawlURIARWrapper object
(package private)  java.lang.String hostName
          Name of the host that this AdaptiveRevisitHostQueue represents
static int HQSTATE_BUSY
          HQ has maximum number of CrawlURI currently being processed.
static int HQSTATE_EMPTY
          HQ contains no queued CrawlURIs elements.
static int HQSTATE_READY
          HQ has a CrawlURI ready for processing
static int HQSTATE_SNOOZED
          HQ is in a suspended state until it can be woken back up
(package private)  long inProcessing
          Number of URIs belonging to this queue that are being processed at the moment.
(package private)  long nextReadyTime
          Time (in milliseconds) when the HQ will next be ready to issue a URI for processing.
protected  com.sleepycat.bind.EntryBinding primaryKeyBinding
          A binding for the serialization of the primary key (URI string)
protected  com.sleepycat.je.Database primaryUriDB
          Database containing the URI priority queue, indexed by the the URI string.
protected  com.sleepycat.je.Database processingUriDB
          A database containing those URIs that are currently being processed.
protected  com.sleepycat.je.SecondaryDatabase secondaryUriDB
          Secondary index into the primary DB, URIs indexed by the time when they can next be processed again.
(package private)  long size
          Size of queue.
(package private)  int state
          Last known state of HQ -- ALL methods should use getState() to read this value, never read it directly.
protected  CrawlSubstats substats
           
(package private)  int valence
          Number of simultanious connections permitted to this host.
(package private)  long[] wakeUpTime
          Time (in milliseconds) when each URI 'slot' becomes available again.
 
Fields inherited from interface org.archive.crawler.frontier.AdaptiveRevisitAttributeConstants
A_CONTENT_STATE_KEY, A_DISCARD_REVISIT, A_FETCH_OVERDUE, A_LAST_CONTENT_DIGEST, A_LAST_DATESTAMP, A_LAST_ETAG, A_NUMBER_OF_VERSIONS, A_NUMBER_OF_VISITS, A_TIME_OF_NEXT_PROCESSING, A_WAIT_INTERVAL, A_WAIT_REEVALUATED, CONTENT_CHANGED, CONTENT_UNCHANGED, CONTENT_UNKNOWN
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
AdaptiveRevisitHostQueue(java.lang.String hostName, com.sleepycat.je.Environment env, com.sleepycat.bind.serial.StoredClassCatalog catalog, int valence)
          Constructor
 
Method Summary
 void add(CrawlURI curi, boolean overrideSetTimeOnDups)
          Add a CrawlURI to this host queue.
protected  void addInProcessing(CrawlURI curi)
          Adds a CrawlURI to the list of CrawlURIs belonging to this HQ and are being processed at the moment.
 void close()
          Cleanup all open Berkeley Database objects.
protected  long countCrawlURIs()
          Count all entries in both primaryUriDB and processingUriDB.
protected  void deleteInProcessing(java.lang.String uri)
          Removes a URI from the list of URIs belonging to this HQ and are currently being processed.
protected  void flushProcessingURIs()
          Flush any CrawlURIs in the processingUriDB into the primaryUriDB.
protected  CrawlURI getCrawlURI(java.lang.String uri)
          Returns the CrawlURI associated with the specified URI (string) or null if no such CrawlURI is queued in this HQ.
 java.lang.String getHostName()
          Returns the HQ's name
 long getNextReadyTime()
          Returns the time when the HQ will next be ready to issue a URI.
 long getSize()
          Returns the size of the HQ.
 int getState()
          Returns the current state of the HQ.
 java.lang.String getStateByName()
          Same as getState() except this method returns a human readable name for the state instead of its constant integer value.
 CrawlSubstats getSubstats()
           
protected  boolean inProcessing(java.lang.String uri)
          Returns true if this HQ has a CrawlURI matching the uri string currently being processed.
 CrawlURI next()
          Returns the 'top' URI in the AdaptiveRevisitHostQueue.
 CrawlURI peek()
          Returns the URI with the earliest time of next processing.
protected  void reorder()
          Method is called whenever something has been done that might have changed the value of the 'published' time of next ready.
 java.lang.String report(int max)
          Returns a report detailing the status of this HQ.
protected  void setNextReadyTime(long newTime)
          Updates nextReadyTime (if smaller) with the supplied value
 void setOwner(AdaptiveRevisitQueueList owner)
          Set the AdaptiveRevisitQueueList object that contains this HQ.
protected  com.sleepycat.je.OperationStatus strictAdd(CrawlURI curi, boolean overrideDuplicates)
          An internal method for adding URIs to the queue.
 void update(CrawlURI curi, boolean needWait, long wakeupTime)
          Update CrawlURI that has completed processing.
 void update(CrawlURI curi, boolean needWait, long wakeupTime, boolean forgetURI)
          Update CrawlURI that has completed processing.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

HQSTATE_EMPTY

public static final int HQSTATE_EMPTY
HQ contains no queued CrawlURIs elements. This state only occurs after queue creation before the first add. After the first item is added the state can never become empty again.

See Also:
Constant Field Values

HQSTATE_READY

public static final int HQSTATE_READY
HQ has a CrawlURI ready for processing

See Also:
Constant Field Values

HQSTATE_BUSY

public static final int HQSTATE_BUSY
HQ has maximum number of CrawlURI currently being processed. This number is either equal to the 'valence' (maximum number of simultanious connections to a host) or (if smaller) the total number of CrawlURIs in the HQ.

See Also:
Constant Field Values

HQSTATE_SNOOZED

public static final int HQSTATE_SNOOZED
HQ is in a suspended state until it can be woken back up

See Also:
Constant Field Values

hostName

final java.lang.String hostName
Name of the host that this AdaptiveRevisitHostQueue represents


state

int state
Last known state of HQ -- ALL methods should use getState() to read this value, never read it directly.


nextReadyTime

long nextReadyTime
Time (in milliseconds) when the HQ will next be ready to issue a URI for processing. When setting this value, methods should use the setter method setNextReadyTime()


wakeUpTime

long[] wakeUpTime
Time (in milliseconds) when each URI 'slot' becomes available again.

Any positive value larger then the current time signifies a taken slot where the URI has completed processing but the politness wait has not ended.

A zero or positive value smaller then the current time in milliseconds signifies an empty slot.

Any negative value signifies a slot for a URI that is being processed.

Methods should never write directly to this, rather use the updateWakeUpTimeSlot() and useWakeUpTimeSlot() methods as needed.


valence

int valence
Number of simultanious connections permitted to this host. I.e. this many URIs can be issued before state of HQ becomes busy until one of them is returned via the update method.


size

long size
Size of queue. That is, the number of CrawlURIs that have been added to it, including any that are currently being processed.


inProcessing

long inProcessing
Number of URIs belonging to this queue that are being processed at the moment. This number will always be in the range of 0 - valence


substats

protected CrawlSubstats substats

primaryUriDB

protected com.sleepycat.je.Database primaryUriDB
Database containing the URI priority queue, indexed by the the URI string.


secondaryUriDB

protected com.sleepycat.je.SecondaryDatabase secondaryUriDB
Secondary index into the primary DB, URIs indexed by the time when they can next be processed again.


processingUriDB

protected com.sleepycat.je.Database processingUriDB
A database containing those URIs that are currently being processed.


classCatalog

protected com.sleepycat.bind.serial.StoredClassCatalog classCatalog
For BDB serialization of objects


primaryKeyBinding

protected com.sleepycat.bind.EntryBinding primaryKeyBinding
A binding for the serialization of the primary key (URI string)


crawlURIBinding

protected com.sleepycat.bind.EntryBinding crawlURIBinding
A binding for the CrawlURIARWrapper object

Constructor Detail

AdaptiveRevisitHostQueue

public AdaptiveRevisitHostQueue(java.lang.String hostName,
                                com.sleepycat.je.Environment env,
                                com.sleepycat.bind.serial.StoredClassCatalog catalog,
                                int valence)
                         throws java.io.IOException
Constructor

Parameters:
hostName - Name of the host this queue represents. This name must be unique for all HQs in the same Environment.
env - Berkeley DB Environment. All BDB databases created will use it.
catalog - Db for bdb class serialization.
valence - The total number of simultanous URIs that the HQ can issue for processing. Once this many URIs have been issued for processing, the HQ will go into busy state until at least one of the URI is updated. Value should be larger then zero. Zero and negative values will be treated same as 1.
Throws:
java.io.IOException - if an error occurs opening/creating the database
Method Detail

getHostName

public java.lang.String getHostName()
Returns the HQ's name

Returns:
the HQ's name

add

public void add(CrawlURI curi,
                boolean overrideSetTimeOnDups)
         throws java.io.IOException
Add a CrawlURI to this host queue.

Calls can optionally chose to have the time of next processing value override existing values for the URI if the existing values are 'later' then the new ones.

Parameters:
curi - The CrawlURI to add.
overrideSetTimeOnDups - If true then the time of next processing for the supplied URI will override the any existing time for it already stored in the HQ. If false, then no changes will be made to any existing values of the URI. Note: Will never override with a later time.
Throws:
java.io.IOException - When an error occurs accessing the database

strictAdd

protected com.sleepycat.je.OperationStatus strictAdd(CrawlURI curi,
                                                     boolean overrideDuplicates)
                                              throws com.sleepycat.je.DatabaseException
An internal method for adding URIs to the queue.

Parameters:
curi - The CrawlURI to add
overrideDuplicates - If true then any existing CrawlURI in the DB will be overwritten. If false insert into the queue is only performed if the key doesn't already exist.
Returns:
The OperationStatus object returned by the put method.
Throws:
com.sleepycat.je.DatabaseException

flushProcessingURIs

protected void flushProcessingURIs()
                            throws com.sleepycat.je.DatabaseException,
                                   java.io.IOException
Flush any CrawlURIs in the processingUriDB into the primaryUriDB. URIs flushed will have their 'time of next fetch' maintained and the nextReadyTime will be updated if needed.

No change is made to the list of available slots.

Throws:
com.sleepycat.je.DatabaseException - if one occurs while flushing
java.io.IOException

countCrawlURIs

protected long countCrawlURIs()
                       throws com.sleepycat.je.DatabaseException
Count all entries in both primaryUriDB and processingUriDB.

This method is needed since BDB does not provide a simple way of counting entries.

Note: This is an expensive operation, requires a loop through the entire queue!

Returns:
the number of distinct CrawlURIs in the HQ.
Throws:
com.sleepycat.je.DatabaseException

inProcessing

protected boolean inProcessing(java.lang.String uri)
                        throws com.sleepycat.je.DatabaseException
Returns true if this HQ has a CrawlURI matching the uri string currently being processed. False otherwise.

Parameters:
uri - Uri to check
Returns:
true if this HQ has a CrawlURI matching the uri string currently being processed. False otherwise.
Throws:
com.sleepycat.je.DatabaseException

deleteInProcessing

protected void deleteInProcessing(java.lang.String uri)
                           throws com.sleepycat.je.DatabaseException,
                                  java.io.IOException
Removes a URI from the list of URIs belonging to this HQ and are currently being processed.

Returns true if successful, false if the URI was not found.

Parameters:
uri - The URI string of the CrawlURI to delete.
Throws:
com.sleepycat.je.DatabaseException
java.lang.IllegalStateException - if the URI was not on the list
java.io.IOException

addInProcessing

protected void addInProcessing(CrawlURI curi)
                        throws com.sleepycat.je.DatabaseException,
                               java.lang.IllegalStateException
Adds a CrawlURI to the list of CrawlURIs belonging to this HQ and are being processed at the moment.

Parameters:
curi - The CrawlURI to add to the list
Throws:
com.sleepycat.je.DatabaseException
java.lang.IllegalStateException - if the CrawlURI is already in the list of URIs being processed.

getCrawlURI

protected CrawlURI getCrawlURI(java.lang.String uri)
                        throws com.sleepycat.je.DatabaseException
Returns the CrawlURI associated with the specified URI (string) or null if no such CrawlURI is queued in this HQ. If CrawlURI is being processed it is not considered to be queued and this method will return null for any such URIs.

Parameters:
uri - A string representing the URI
Returns:
the CrawlURI associated with the specified URI (string) or null if no such CrawlURI is queued in this HQ.
Throws:
com.sleepycat.je.DatabaseException - if a errors occurs reading the database

update

public void update(CrawlURI curi,
                   boolean needWait,
                   long wakeupTime)
            throws java.lang.IllegalStateException,
                   java.io.IOException
Update CrawlURI that has completed processing.

Parameters:
curi - The CrawlURI. This must be a CrawlURI issued by this HQ's next() method.
needWait - If true then the URI was processed successfully, requiring a period of suspended action on that host. If valence is > 1 then seperate times are maintained for each slot.
wakeupTime - If new state is snoozed then this parameter should contain the time (in milliseconds) when it will be safe to wake the HQ up again. Otherwise this parameter will be ignored.
Throws:
java.lang.IllegalStateException - if the CrawlURI does not match a CrawlURI issued for crawling by this HQ's next().
java.io.IOException - if an error occurs accessing the database

update

public void update(CrawlURI curi,
                   boolean needWait,
                   long wakeupTime,
                   boolean forgetURI)
            throws java.lang.IllegalStateException,
                   java.io.IOException
Update CrawlURI that has completed processing.

Parameters:
curi - The CrawlURI. This must be a CrawlURI issued by this HQ's next() method.
needWait - If true then the URI was processed successfully, requiring a period of suspended action on that host. If valence is > 1 then seperate times are maintained for each slot.
wakeupTime - If new state is snoozed then this parameter should contain the time (in milliseconds) when it will be safe to wake the HQ up again. Otherwise this parameter will be ignored.
forgetURI - If true, the URI will be deleted from the queue.
Throws:
java.lang.IllegalStateException - if the CrawlURI does not match a CrawlURI issued for crawling by this HQ's next().
java.io.IOException - if an error occurs accessing the database

next

public CrawlURI next()
              throws java.lang.IllegalStateException,
                     java.io.IOException
Returns the 'top' URI in the AdaptiveRevisitHostQueue.

HQ state will be set to busy if this method returns normally.

Returns:
a CrawlURI ready for processing
Throws:
java.lang.IllegalStateException - if the HostQueues current state is not ready ready
java.io.IOException - if an error occurs reading from the database

peek

public CrawlURI peek()
              throws java.lang.IllegalStateException,
                     java.io.IOException
Returns the URI with the earliest time of next processing. I.e. the URI at the head of this host based priority queue.

Note: This method will return the head CrawlURI regardless of wether it is safe to start processing it or not. CrawlURI will remain in the queue. The returned CrawlURI should only be used for queue inspection, it can not be updated and returned to the queue. To get URIs ready for processing use next().

Returns:
the URI with the earliest time of next processing or null if the queue is empty or all URIs are currently being processed.
Throws:
java.lang.IllegalStateException
java.io.IOException - if an error occurs reading from the database

getState

public int getState()
Returns the current state of the HQ.

Returns:
the current state of the HQ.
See Also:
HQSTATE_BUSY, HQSTATE_EMPTY, HQSTATE_READY, HQSTATE_SNOOZED

getNextReadyTime

public long getNextReadyTime()
Returns the time when the HQ will next be ready to issue a URI.

If the queue is in a snoozed state then this time will be in the future and reflects either the time when the HQ will again be able to issue URIs for processing because politness constraints have ended, or when a URI next becomes available for visit, whichever is larger.

If the queue is in a ready state this time will be in the past and reflect the earliest time when the HQ had a URI ready for processing, taking time spent snoozed for politness concerns into account.

If the HQ is in any other state then the return value of this method is equal to Long.MAX_VALUE.

This value may change each time a URI is added, issued or updated.

Returns:
the time when the HQ will next be ready to issue a URI

setNextReadyTime

protected void setNextReadyTime(long newTime)
Updates nextReadyTime (if smaller) with the supplied value

Parameters:
newTime - the new value of nextReady Time;

reorder

protected void reorder()
Method is called whenever something has been done that might have changed the value of the 'published' time of next ready. If an owner has been specified it will be notified that the value may have changed..


getStateByName

public java.lang.String getStateByName()
Same as getState() except this method returns a human readable name for the state instead of its constant integer value.

Should only be used for reports, error messages and other strings intended for human eyes.

Returns:
the human readable name of the current state

getSize

public long getSize()
Returns the size of the HQ. That is, the number of URIs queued, including any that are currently being processed.

Returns:
the size of the HQ.

setOwner

public void setOwner(AdaptiveRevisitQueueList owner)
Set the AdaptiveRevisitQueueList object that contains this HQ. Will cause that object to be notified (via reorder() when the value used for sorting the list of HQs changes.

Parameters:
owner - the ARHostQueueList object that contains this HQ.

close

public void close()
           throws java.io.IOException
Cleanup all open Berkeley Database objects.

Does not close the Environment.

Throws:
java.io.IOException - if an error occurs closing a database object

report

public java.lang.String report(int max)
Returns a report detailing the status of this HQ.

Parameters:
max - Maximum number of URIs to show. 0 equals no limit.
Returns:
a report detailing the status of this HQ.

getSubstats

public CrawlSubstats getSubstats()
Specified by:
getSubstats in interface CrawlSubstats.HasCrawlSubstats


Copyright © 2003-2011 Internet Archive. All Rights Reserved.