|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.crawler.frontier.AdaptiveRevisitHostQueue
public class AdaptiveRevisitHostQueue
A priority based queue of CrawlURIs. Each queue should represent one host (although this is not enforced in this class). Items are ordered by the scheduling directive and time of next processing (in that order) and also indexed by the URI.
The HQ does no calculations on the 'time of next processing.' It always relies on values already set on the CrawlURI.
Note: Class is not 'thread safe.' In multi threaded environment the caller must ensure that two threads do not make overlapping calls.
Any BDB DatabaseException will be converted to an IOException by public methods. This includes preserving the original stacktrace, in favor of the one created for the IOException, so that the true source of the exception is not lost.
Field Summary | |
---|---|
protected com.sleepycat.bind.serial.StoredClassCatalog |
classCatalog
For BDB serialization of objects |
protected com.sleepycat.bind.EntryBinding |
crawlURIBinding
A binding for the CrawlURIARWrapper object |
(package private) java.lang.String |
hostName
Name of the host that this AdaptiveRevisitHostQueue represents |
static int |
HQSTATE_BUSY
HQ has maximum number of CrawlURI currently being processed. |
static int |
HQSTATE_EMPTY
HQ contains no queued CrawlURIs elements. |
static int |
HQSTATE_READY
HQ has a CrawlURI ready for processing |
static int |
HQSTATE_SNOOZED
HQ is in a suspended state until it can be woken back up |
(package private) long |
inProcessing
Number of URIs belonging to this queue that are being processed at the moment. |
(package private) long |
nextReadyTime
Time (in milliseconds) when the HQ will next be ready to issue a URI for processing. |
protected com.sleepycat.bind.EntryBinding |
primaryKeyBinding
A binding for the serialization of the primary key (URI string) |
protected com.sleepycat.je.Database |
primaryUriDB
Database containing the URI priority queue, indexed by the the URI string. |
protected com.sleepycat.je.Database |
processingUriDB
A database containing those URIs that are currently being processed. |
protected com.sleepycat.je.SecondaryDatabase |
secondaryUriDB
Secondary index into the primary DB , URIs indexed
by the time when they can next be processed again. |
(package private) long |
size
Size of queue. |
(package private) int |
state
Last known state of HQ -- ALL methods should use getState() to read this value, never read it directly. |
protected CrawlSubstats |
substats
|
(package private) int |
valence
Number of simultanious connections permitted to this host. |
(package private) long[] |
wakeUpTime
Time (in milliseconds) when each URI 'slot' becomes available again. |
Fields inherited from interface org.archive.crawler.frontier.AdaptiveRevisitAttributeConstants |
---|
A_CONTENT_STATE_KEY, A_DISCARD_REVISIT, A_FETCH_OVERDUE, A_LAST_CONTENT_DIGEST, A_LAST_DATESTAMP, A_LAST_ETAG, A_NUMBER_OF_VERSIONS, A_NUMBER_OF_VISITS, A_TIME_OF_NEXT_PROCESSING, A_WAIT_INTERVAL, A_WAIT_REEVALUATED, CONTENT_CHANGED, CONTENT_UNCHANGED, CONTENT_UNKNOWN |
Constructor Summary | |
---|---|
AdaptiveRevisitHostQueue(java.lang.String hostName,
com.sleepycat.je.Environment env,
com.sleepycat.bind.serial.StoredClassCatalog catalog,
int valence)
Constructor |
Method Summary | |
---|---|
void |
add(CrawlURI curi,
boolean overrideSetTimeOnDups)
Add a CrawlURI to this host queue. |
protected void |
addInProcessing(CrawlURI curi)
Adds a CrawlURI to the list of CrawlURIs belonging to this HQ and are being processed at the moment. |
void |
close()
Cleanup all open Berkeley Database objects. |
protected long |
countCrawlURIs()
Count all entries in both primaryUriDB and processingUriDB. |
protected void |
deleteInProcessing(java.lang.String uri)
Removes a URI from the list of URIs belonging to this HQ and are currently being processed. |
protected void |
flushProcessingURIs()
Flush any CrawlURIs in the processingUriDB into the primaryUriDB. |
protected CrawlURI |
getCrawlURI(java.lang.String uri)
Returns the CrawlURI associated with the specified URI (string) or null if no such CrawlURI is queued in this HQ. |
java.lang.String |
getHostName()
Returns the HQ's name |
long |
getNextReadyTime()
Returns the time when the HQ will next be ready to issue a URI. |
long |
getSize()
Returns the size of the HQ. |
int |
getState()
Returns the current state of the HQ. |
java.lang.String |
getStateByName()
Same as getState() except this method returns a
human readable name for the state instead of its constant integer value. |
CrawlSubstats |
getSubstats()
|
protected boolean |
inProcessing(java.lang.String uri)
Returns true if this HQ has a CrawlURI matching the uri string currently being processed. |
CrawlURI |
next()
Returns the 'top' URI in the AdaptiveRevisitHostQueue. |
CrawlURI |
peek()
Returns the URI with the earliest time of next processing. |
protected void |
reorder()
Method is called whenever something has been done that might have changed the value of the 'published' time of next ready. |
java.lang.String |
report(int max)
Returns a report detailing the status of this HQ. |
protected void |
setNextReadyTime(long newTime)
Updates nextReadyTime (if smaller) with the supplied value |
void |
setOwner(AdaptiveRevisitQueueList owner)
Set the AdaptiveRevisitQueueList object that contains this HQ. |
protected com.sleepycat.je.OperationStatus |
strictAdd(CrawlURI curi,
boolean overrideDuplicates)
An internal method for adding URIs to the queue. |
void |
update(CrawlURI curi,
boolean needWait,
long wakeupTime)
Update CrawlURI that has completed processing. |
void |
update(CrawlURI curi,
boolean needWait,
long wakeupTime,
boolean forgetURI)
Update CrawlURI that has completed processing. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int HQSTATE_EMPTY
public static final int HQSTATE_READY
public static final int HQSTATE_BUSY
public static final int HQSTATE_SNOOZED
final java.lang.String hostName
int state
long nextReadyTime
setNextReadyTime()
long[] wakeUpTime
Any positive value larger then the current time signifies a taken slot where the URI has completed processing but the politness wait has not ended.
A zero or positive value smaller then the current time in milliseconds signifies an empty slot.
Any negative value signifies a slot for a URI that is being processed.
Methods should never write directly to this, rather use the
updateWakeUpTimeSlot()
and
useWakeUpTimeSlot()
methods as needed.
int valence
long size
long inProcessing
protected CrawlSubstats substats
protected com.sleepycat.je.Database primaryUriDB
protected com.sleepycat.je.SecondaryDatabase secondaryUriDB
the primary DB
, URIs indexed
by the time when they can next be processed again.
protected com.sleepycat.je.Database processingUriDB
protected com.sleepycat.bind.serial.StoredClassCatalog classCatalog
protected com.sleepycat.bind.EntryBinding primaryKeyBinding
protected com.sleepycat.bind.EntryBinding crawlURIBinding
Constructor Detail |
---|
public AdaptiveRevisitHostQueue(java.lang.String hostName, com.sleepycat.je.Environment env, com.sleepycat.bind.serial.StoredClassCatalog catalog, int valence) throws java.io.IOException
hostName
- Name of the host this queue represents. This name must
be unique for all HQs in the same Environment.env
- Berkeley DB Environment. All BDB databases created will use
it.catalog
- Db for bdb class serialization.valence
- The total number of simultanous URIs that the HQ can issue
for processing. Once this many URIs have been issued for
processing, the HQ will go into busy
state until at least one of the URI is
updated
.
Value should be larger then zero. Zero and negative values
will be treated same as 1.
java.io.IOException
- if an error occurs opening/creating the
databaseMethod Detail |
---|
public java.lang.String getHostName()
public void add(CrawlURI curi, boolean overrideSetTimeOnDups) throws java.io.IOException
Calls can optionally chose to have the time of next processing value override existing values for the URI if the existing values are 'later' then the new ones.
curi
- The CrawlURI to add.overrideSetTimeOnDups
- If true then the time of next processing for
the supplied URI will override the any
existing time for it already stored in the HQ.
If false, then no changes will be made to any
existing values of the URI. Note: Will never
override with a later time.
java.io.IOException
- When an error occurs accessing the databaseprotected com.sleepycat.je.OperationStatus strictAdd(CrawlURI curi, boolean overrideDuplicates) throws com.sleepycat.je.DatabaseException
curi
- The CrawlURI to addoverrideDuplicates
- If true then any existing CrawlURI in the DB
will be overwritten. If false insert into the
queue is only performed if the key doesn't
already exist.
com.sleepycat.je.DatabaseException
protected void flushProcessingURIs() throws com.sleepycat.je.DatabaseException, java.io.IOException
No change is made to the list of available slots.
com.sleepycat.je.DatabaseException
- if one occurs while flushing
java.io.IOException
protected long countCrawlURIs() throws com.sleepycat.je.DatabaseException
This method is needed since BDB does not provide a simple way of counting entries.
Note: This is an expensive operation, requires a loop through the entire queue!
com.sleepycat.je.DatabaseException
protected boolean inProcessing(java.lang.String uri) throws com.sleepycat.je.DatabaseException
uri
- Uri to check
com.sleepycat.je.DatabaseException
protected void deleteInProcessing(java.lang.String uri) throws com.sleepycat.je.DatabaseException, java.io.IOException
Returns true if successful, false if the URI was not found.
uri
- The URI string of the CrawlURI to delete.
com.sleepycat.je.DatabaseException
java.lang.IllegalStateException
- if the URI was not on the list
java.io.IOException
protected void addInProcessing(CrawlURI curi) throws com.sleepycat.je.DatabaseException, java.lang.IllegalStateException
curi
- The CrawlURI to add to the list
com.sleepycat.je.DatabaseException
java.lang.IllegalStateException
- if the CrawlURI is already in the list of URIs being
processed.protected CrawlURI getCrawlURI(java.lang.String uri) throws com.sleepycat.je.DatabaseException
uri
- A string representing the URI
com.sleepycat.je.DatabaseException
- if a errors occurs reading the databasepublic void update(CrawlURI curi, boolean needWait, long wakeupTime) throws java.lang.IllegalStateException, java.io.IOException
curi
- The CrawlURI. This must be a CrawlURI issued by this HQ's
next()
method.needWait
- If true then the URI was processed successfully,
requiring a period of suspended action on that host. If
valence is > 1 then seperate times are maintained for
each slot.wakeupTime
- If new state is
snoozed
then this parameter should contain the time (in
milliseconds) when it will be safe to wake the HQ up
again. Otherwise this parameter will be ignored.
java.lang.IllegalStateException
- if the CrawlURI
does not match a CrawlURI issued for crawling by this HQ's
next()
.
java.io.IOException
- if an error occurs accessing the databasepublic void update(CrawlURI curi, boolean needWait, long wakeupTime, boolean forgetURI) throws java.lang.IllegalStateException, java.io.IOException
curi
- The CrawlURI. This must be a CrawlURI issued by this HQ's
next()
method.needWait
- If true then the URI was processed successfully,
requiring a period of suspended action on that host. If
valence is > 1 then seperate times are maintained for
each slot.wakeupTime
- If new state is
snoozed
then this parameter should contain the time (in
milliseconds) when it will be safe to wake the HQ up
again. Otherwise this parameter will be ignored.forgetURI
- If true, the URI will be deleted from the queue.
java.lang.IllegalStateException
- if the CrawlURI
does not match a CrawlURI issued for crawling by this HQ's
next()
.
java.io.IOException
- if an error occurs accessing the databasepublic CrawlURI next() throws java.lang.IllegalStateException, java.io.IOException
HQ state will be set to busy
if this
method returns normally.
java.lang.IllegalStateException
- if the HostQueues current state is not
ready ready
java.io.IOException
- if an error occurs reading from the databasepublic CrawlURI peek() throws java.lang.IllegalStateException, java.io.IOException
Note: This method will return the head CrawlURI regardless of wether it
is safe to start processing it or not. CrawlURI will remain in the queue.
The returned CrawlURI should only be used for queue inspection, it can
not be updated and returned to the queue. To get URIs ready for
processing use next()
.
java.lang.IllegalStateException
java.io.IOException
- if an error occurs reading from the databasepublic int getState()
HQSTATE_BUSY
,
HQSTATE_EMPTY
,
HQSTATE_READY
,
HQSTATE_SNOOZED
public long getNextReadyTime()
If the queue is in a snoozed
state then this
time will be in the future and reflects either the time when the HQ will
again be able to issue URIs for processing because politness constraints
have ended, or when a URI next becomes available for visit, whichever is
larger.
If the queue is in a ready
state this time will
be in the past and reflect the earliest time when the HQ had a URI ready
for processing, taking time spent snoozed for politness concerns into
account.
If the HQ is in any other state then the return value of this method is equal to Long.MAX_VALUE.
This value may change each time a URI is added, issued or updated.
protected void setNextReadyTime(long newTime)
newTime
- the new value of nextReady Time;protected void reorder()
public java.lang.String getStateByName()
getState()
except this method returns a
human readable name for the state instead of its constant integer value.
Should only be used for reports, error messages and other strings intended for human eyes.
public long getSize()
public void setOwner(AdaptiveRevisitQueueList owner)
reorder()
when the
value used for sorting the list of HQs changes.
owner
- the ARHostQueueList object that contains this HQ.public void close() throws java.io.IOException
Does not close the Environment.
java.io.IOException
- if an error occurs closing a database objectpublic java.lang.String report(int max)
max
- Maximum number of URIs to show. 0 equals no limit.
public CrawlSubstats getSubstats()
getSubstats
in interface CrawlSubstats.HasCrawlSubstats
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |