|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javax.management.Attribute org.archive.crawler.settings.Type org.archive.crawler.settings.ComplexType org.archive.crawler.settings.ModuleType org.archive.crawler.frontier.AbstractFrontier org.archive.crawler.frontier.WorkQueueFrontier org.archive.crawler.frontier.BdbFrontier
public class BdbFrontier
A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.crawler.frontier.WorkQueueFrontier |
---|
WorkQueueFrontier.WakeTask |
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier |
---|
Frontier.FrontierGroup |
Field Summary | |
---|---|
static java.lang.String |
ATTR_DUMP_PENDING_AT_CLOSE
URI-already-included to use (by class name) |
static java.lang.String |
ATTR_INCLUDED
URI-already-included to use (by class name) |
protected BdbMultipleWorkQueues |
pendingUris
all URIs scheduled to be crawled |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Fields inherited from interface org.archive.crawler.framework.Frontier |
---|
ATTR_NAME |
Constructor Summary | |
---|---|
BdbFrontier(java.lang.String name)
Constructor. |
|
BdbFrontier(java.lang.String name,
java.lang.String description)
Create the BdbFrontier |
Method Summary | |
---|---|
protected void |
closeQueue()
|
void |
crawlCheckpoint(java.io.File checkpointDir)
Called by CrawlController when checkpointing. |
void |
crawlEnded(java.lang.String sExitMessage)
Called when a CrawlController has ended a crawl and is about to exit. |
protected UriUniqFilter |
createAlreadyIncluded()
Create a UriUniqFilter that will serve as record of already seen URIs. |
protected UriUniqFilter |
deserializeAlreadySeen(java.lang.Class<? extends UriUniqFilter> cls,
java.io.File dir)
|
void |
dumpAllPendingToLog()
Dump all still-enqueued URIs to the crawl.log -- without actually dequeuing. |
void |
finalTasks()
Perform any final tasks *before* notification crawl has reached 'FINISHED' status. |
FrontierMarker |
getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
Get a URIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier. |
protected WorkQueue |
getQueueFor(CrawlURI curi)
Return the work queue for the given CrawlURI's classKey. |
protected WorkQueue |
getQueueFor(java.lang.String classKey)
Return the work queue for the given classKey, or null if no such queue exists. |
java.util.ArrayList<java.lang.String> |
getURIsList(FrontierMarker marker,
int numberOfMatches,
boolean verbose)
Return list of urls. |
protected BdbMultipleWorkQueues |
getWorkQueues()
|
void |
initialize(CrawlController c)
Initializes the Frontier, given the supplied CrawlController. |
protected void |
initQueue()
|
protected void |
initQueuesOfQueues()
Set up the various queues-of-queues used by the frontier. |
protected java.util.Queue<java.lang.String> |
reinit(java.util.Queue<java.lang.String> q,
java.lang.String name)
|
protected boolean |
workQueueDataOnDisk()
Returns true if the WorkQueue implementation of this
Frontier stores its workload on disk instead of relying
on serialization mechanisms. |
Methods inherited from class org.archive.crawler.frontier.WorkQueueFrontier |
---|
appendQueueReports, asCrawlUri, averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, deleteURIs, discoveredUriCount, finished, forceWakeQueues, forget, getGroup, getReports, isEmpty, kickUpdate, next, receive, reportTo, schedule, sendToQueue, singleLineLegend, singleLineReportTo, wakeQueues, wakeQueuesAsIfAtTime |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement, listUsedFiles |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
protected transient BdbMultipleWorkQueues pendingUris
public static final java.lang.String ATTR_INCLUDED
public static final java.lang.String ATTR_DUMP_PENDING_AT_CLOSE
Constructor Detail |
---|
public BdbFrontier(java.lang.String name)
name
- Name for of this Frontier.public BdbFrontier(java.lang.String name, java.lang.String description)
name
- description
- Method Detail |
---|
protected void initQueuesOfQueues()
WorkQueueFrontier
initQueuesOfQueues
in class WorkQueueFrontier
protected java.util.Queue<java.lang.String> reinit(java.util.Queue<java.lang.String> q, java.lang.String name)
protected UriUniqFilter createAlreadyIncluded() throws java.io.IOException
createAlreadyIncluded
in class WorkQueueFrontier
java.io.IOException
protected UriUniqFilter deserializeAlreadySeen(java.lang.Class<? extends UriUniqFilter> cls, java.io.File dir) throws java.io.FileNotFoundException, java.io.IOException
java.io.FileNotFoundException
java.io.IOException
protected WorkQueue getQueueFor(CrawlURI curi)
getQueueFor
in class WorkQueueFrontier
curi
- CrawlURI to base queue on
protected WorkQueue getQueueFor(java.lang.String classKey)
getQueueFor
in class WorkQueueFrontier
classKey
- key to look for
public FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
Frontier
URIFrontierMarker
initialized with the given
regular expression at the 'start' of the Frontier.
getInitialMarker
in interface Frontier
regexpr
- The regular expression that URIs within the frontier must
match to be considered within the scope of this markerinCacheOnly
- If set to true, only those URIs within the frontier
that are stored in cache (usually this means in memory
rather then on disk, but that is an implementation
detail) will be considered. Others will be entierly
ignored, as if they dont exist. This is usefull for quick
peeks at the top of the URI list.
public java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
getURIsList
in interface Frontier
marker
- numberOfMatches
- verbose
-
FrontierMarker
,
Frontier.getInitialMarker(String, boolean)
protected void initQueue() throws java.io.IOException
initQueue
in class WorkQueueFrontier
java.io.IOException
public void finalTasks()
Frontier
finalTasks
in interface Frontier
protected void closeQueue()
closeQueue
in class WorkQueueFrontier
protected BdbMultipleWorkQueues getWorkQueues()
protected boolean workQueueDataOnDisk()
WorkQueueFrontier
true
if the WorkQueue implementation of this
Frontier stores its workload on disk instead of relying
on serialization mechanisms.
TODO: rename! (this is a very misleading name) or kill (don't
see any implementations that return false)
workQueueDataOnDisk
in class WorkQueueFrontier
public void initialize(CrawlController c) throws FatalConfigurationException, java.io.IOException
WorkQueueFrontier
initialize
in interface Frontier
initialize
in class WorkQueueFrontier
c
- The CrawlController that created the Frontier.
FatalConfigurationException
- If provided settings are illegal or
otherwise unusable.
java.io.IOException
- If there is a problem reading settings or seeds file
from disk.Frontier.initialize(org.archive.crawler.framework.CrawlController)
public void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded
in interface CrawlStatusListener
crawlEnded
in class WorkQueueFrontier
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlCheckpoint(java.io.File checkpointDir) throws java.lang.Exception
CrawlStatusListener
CrawlController
when checkpointing.
crawlCheckpoint
in interface CrawlStatusListener
crawlCheckpoint
in class AbstractFrontier
checkpointDir
- Checkpoint dir. Write checkpoint state here.
java.lang.Exception
- A fatal exception. Any exceptions
that are let out of this checkpoint are assumed fatal
and terminate further checkpoint processing.public void dumpAllPendingToLog() throws com.sleepycat.je.DatabaseException
com.sleepycat.je.DatabaseException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |