org.archive.crawler.frontier
Class BdbFrontier

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.frontier.AbstractFrontier
                      extended by org.archive.crawler.frontier.WorkQueueFrontier
                          extended by org.archive.crawler.frontier.BdbFrontier
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, UriUniqFilter.HasUriReceiver, CrawlStatusListener, Frontier, Reporter
Direct Known Subclasses:
DomainSensitiveFrontier

public class BdbFrontier
extends WorkQueueFrontier
implements java.io.Serializable

A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.

Author:
Gordon Mohr
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.frontier.WorkQueueFrontier
WorkQueueFrontier.WakeTask
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier
Frontier.FrontierGroup
 
Field Summary
static java.lang.String ATTR_DUMP_PENDING_AT_CLOSE
          URI-already-included to use (by class name)
static java.lang.String ATTR_INCLUDED
          URI-already-included to use (by class name)
protected  BdbMultipleWorkQueues pendingUris
          all URIs scheduled to be crawled
 
Fields inherited from class org.archive.crawler.frontier.WorkQueueFrontier
ALL_NONEMPTY, ALL_QUEUES, allQueues, alreadyIncluded, ATTR_BALANCE_REPLENISH_AMOUNT, ATTR_COST_POLICY, ATTR_ERROR_PENALTY_AMOUNT, ATTR_HOLD_QUEUES, ATTR_QUEUE_TOTAL_BUDGET, ATTR_SNOOZE_DEACTIVATE_MS, ATTR_TARGET_READY_QUEUES_BACKLOG, AVAILABLE_COST_POLICIES, DEFAULT_BALANCE_REPLENISH_AMOUNT, DEFAULT_COST_POLICY, DEFAULT_ERROR_PENALTY_AMOUNT, DEFAULT_HOLD_QUEUES, DEFAULT_QUEUE_TOTAL_BUDGET, DEFAULT_SNOOZE_DEACTIVATE_MS, DEFAULT_TARGET_READY_QUEUES_BACKLOG, inactiveQueues, inProcessQueues, longestActiveQueue, nextWake, readyClassQueues, readyFiller, REPORTS, retiredQueues, snoozedClassQueues, STANDARD_REPORT, targetSizeForReadyQueues, wakeTimer
 
Fields inherited from class org.archive.crawler.frontier.AbstractFrontier
ACCEPTABLE_FORCE_QUEUE, ATTR_DELAY_FACTOR, ATTR_FORCE_QUEUE, ATTR_MAX_DELAY, ATTR_MAX_HOST_BANDWIDTH_USAGE, ATTR_MAX_OVERALL_BANDWIDTH_USAGE, ATTR_MAX_RETRIES, ATTR_MIN_DELAY, ATTR_PAUSE_AT_FINISH, ATTR_PAUSE_AT_START, ATTR_PREFERENCE_EMBED_HOPS, ATTR_QUEUE_ASSIGNMENT_POLICY, ATTR_RECOVERY_ENABLED, ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS, ATTR_RETRY_DELAY, ATTR_SOURCE_TAG_SEEDS, controller, DEFAULT_ATTR_RECOVERY_ENABLED, DEFAULT_DELAY_FACTOR, DEFAULT_FORCE_QUEUE, DEFAULT_MAX_DELAY, DEFAULT_MAX_HOST_BANDWIDTH_USAGE, DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE, DEFAULT_MAX_RETRIES, DEFAULT_MIN_DELAY, DEFAULT_PAUSE_AT_FINISH, DEFAULT_PAUSE_AT_START, DEFAULT_PREFERENCE_EMBED_HOPS, DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS, DEFAULT_RETRY_DELAY, DEFAULT_SOURCE_TAG_SEEDS, disregardedUriCount, failedFetchCount, IGNORED_SEEDS_FILENAME, lastMaxBandwidthKB, liveDisregardedUriCount, liveFailedFetchCount, liveQueuedUriCount, liveSucceededFetchCount, nextOrdinal, processedBytesAfterLastEmittedURI, queuedUriCount, shouldPause, shouldTerminate, succeededFetchCount, totalProcessedBytes
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.framework.Frontier
ATTR_NAME
 
Constructor Summary
BdbFrontier(java.lang.String name)
          Constructor.
BdbFrontier(java.lang.String name, java.lang.String description)
          Create the BdbFrontier
 
Method Summary
protected  void closeQueue()
           
 void crawlCheckpoint(java.io.File checkpointDir)
          Called by CrawlController when checkpointing.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
protected  UriUniqFilter createAlreadyIncluded()
          Create a UriUniqFilter that will serve as record of already seen URIs.
protected  UriUniqFilter deserializeAlreadySeen(java.lang.Class<? extends UriUniqFilter> cls, java.io.File dir)
           
 void dumpAllPendingToLog()
          Dump all still-enqueued URIs to the crawl.log -- without actually dequeuing.
 void finalTasks()
          Perform any final tasks *before* notification crawl has reached 'FINISHED' status.
 FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
          Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.
protected  WorkQueue getQueueFor(CrawlURI curi)
          Return the work queue for the given CrawlURI's classKey.
protected  WorkQueue getQueueFor(java.lang.String classKey)
          Return the work queue for the given classKey, or null if no such queue exists.
 java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
          Return list of urls.
protected  BdbMultipleWorkQueues getWorkQueues()
           
 void initialize(CrawlController c)
          Initializes the Frontier, given the supplied CrawlController.
protected  void initQueue()
           
protected  void initQueuesOfQueues()
          Set up the various queues-of-queues used by the frontier.
protected  java.util.Queue<java.lang.String> reinit(java.util.Queue<java.lang.String> q, java.lang.String name)
           
protected  boolean workQueueDataOnDisk()
          Returns true if the WorkQueue implementation of this Frontier stores its workload on disk instead of relying on serialization mechanisms.
 
Methods inherited from class org.archive.crawler.frontier.WorkQueueFrontier
appendQueueReports, asCrawlUri, averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, deleteURIs, discoveredUriCount, finished, forceWakeQueues, forget, getGroup, getReports, isEmpty, kickUpdate, next, receive, reportTo, schedule, sendToQueue, singleLineLegend, singleLineReportTo, wakeQueues, wakeQueuesAsIfAtTime
 
Methods inherited from class org.archive.crawler.frontier.AbstractFrontier
applySpecialHandling, canonicalize, canonicalize, crawlEnding, crawlPaused, crawlPausing, crawlResuming, crawlStarted, decrementQueuedCount, disregardedUriCount, doJournalAdded, doJournalDisregarded, doJournalEmitted, doJournalFinishedFailure, doJournalFinishedSuccess, doJournalRescheduled, failedFetchCount, finishedUriCount, getClassKey, getFrontierJournal, getQueueAssignmentPolicy, getServer, importRecoverLog, incrementDisregardedUriCount, incrementFailedFetchCount, incrementQueuedUriCount, incrementQueuedUriCount, incrementSucceededFetchCount, isDisregarded, loadSeeds, log, logLocalizedErrors, needsRetrying, noteAboutToEmit, overMaxRetries, pause, politenessDelayFor, preNext, queuedUriCount, reportTo, retryDelayFor, saveIgnoredItems, scratchDirFor, singleLineReport, start, succeededFetchCount, tally, terminate, totalBytesWritten, unpause
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

pendingUris

protected transient BdbMultipleWorkQueues pendingUris
all URIs scheduled to be crawled


ATTR_INCLUDED

public static final java.lang.String ATTR_INCLUDED
URI-already-included to use (by class name)

See Also:
Constant Field Values

ATTR_DUMP_PENDING_AT_CLOSE

public static final java.lang.String ATTR_DUMP_PENDING_AT_CLOSE
URI-already-included to use (by class name)

See Also:
Constant Field Values
Constructor Detail

BdbFrontier

public BdbFrontier(java.lang.String name)
Constructor.

Parameters:
name - Name for of this Frontier.

BdbFrontier

public BdbFrontier(java.lang.String name,
                   java.lang.String description)
Create the BdbFrontier

Parameters:
name -
description -
Method Detail

initQueuesOfQueues

protected void initQueuesOfQueues()
Description copied from class: WorkQueueFrontier
Set up the various queues-of-queues used by the frontier. Override in implementing subclasses to reduce or eliminate risk of queues growing without bound.

Overrides:
initQueuesOfQueues in class WorkQueueFrontier

reinit

protected java.util.Queue<java.lang.String> reinit(java.util.Queue<java.lang.String> q,
                                                   java.lang.String name)

createAlreadyIncluded

protected UriUniqFilter createAlreadyIncluded()
                                       throws java.io.IOException
Create a UriUniqFilter that will serve as record of already seen URIs.

Specified by:
createAlreadyIncluded in class WorkQueueFrontier
Returns:
A UURISet that will serve as a record of already seen URIs
Throws:
java.io.IOException

deserializeAlreadySeen

protected UriUniqFilter deserializeAlreadySeen(java.lang.Class<? extends UriUniqFilter> cls,
                                               java.io.File dir)
                                        throws java.io.FileNotFoundException,
                                               java.io.IOException
Throws:
java.io.FileNotFoundException
java.io.IOException

getQueueFor

protected WorkQueue getQueueFor(CrawlURI curi)
Return the work queue for the given CrawlURI's classKey. URIs are ordered and politeness-delayed within their 'class'.

Specified by:
getQueueFor in class WorkQueueFrontier
Parameters:
curi - CrawlURI to base queue on
Returns:
the found or created BdbWorkQueue

getQueueFor

protected WorkQueue getQueueFor(java.lang.String classKey)
Return the work queue for the given classKey, or null if no such queue exists.

Specified by:
getQueueFor in class WorkQueueFrontier
Parameters:
classKey - key to look for
Returns:
the found WorkQueue

getInitialMarker

public FrontierMarker getInitialMarker(java.lang.String regexpr,
                                       boolean inCacheOnly)
Description copied from interface: Frontier
Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.

Specified by:
getInitialMarker in interface Frontier
Parameters:
regexpr - The regular expression that URIs within the frontier must match to be considered within the scope of this marker
inCacheOnly - If set to true, only those URIs within the frontier that are stored in cache (usually this means in memory rather then on disk, but that is an implementation detail) will be considered. Others will be entierly ignored, as if they dont exist. This is usefull for quick peeks at the top of the URI list.
Returns:
A URIFrontierMarker that is set for the 'start' of the frontier's URI list.

getURIsList

public java.util.ArrayList<java.lang.String> getURIsList(FrontierMarker marker,
                                                         int numberOfMatches,
                                                         boolean verbose)
Return list of urls.

Specified by:
getURIsList in interface Frontier
Parameters:
marker -
numberOfMatches -
verbose -
Returns:
List of URIs (strings).
See Also:
FrontierMarker, Frontier.getInitialMarker(String, boolean)

initQueue

protected void initQueue()
                  throws java.io.IOException
Specified by:
initQueue in class WorkQueueFrontier
Throws:
java.io.IOException

finalTasks

public void finalTasks()
Description copied from interface: Frontier
Perform any final tasks *before* notification crawl has reached 'FINISHED' status. (For example, anything that needs to dump final data to disk/logs.)

Specified by:
finalTasks in interface Frontier

closeQueue

protected void closeQueue()
Specified by:
closeQueue in class WorkQueueFrontier

getWorkQueues

protected BdbMultipleWorkQueues getWorkQueues()

workQueueDataOnDisk

protected boolean workQueueDataOnDisk()
Description copied from class: WorkQueueFrontier
Returns true if the WorkQueue implementation of this Frontier stores its workload on disk instead of relying on serialization mechanisms. TODO: rename! (this is a very misleading name) or kill (don't see any implementations that return false)

Specified by:
workQueueDataOnDisk in class WorkQueueFrontier
Returns:
a constant boolean value for this class/instance

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Description copied from class: WorkQueueFrontier
Initializes the Frontier, given the supplied CrawlController.

Specified by:
initialize in interface Frontier
Overrides:
initialize in class WorkQueueFrontier
Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.
See Also:
Frontier.initialize(org.archive.crawler.framework.CrawlController)

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Overrides:
crawlEnded in class WorkQueueFrontier
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlCheckpoint

public void crawlCheckpoint(java.io.File checkpointDir)
                     throws java.lang.Exception
Description copied from interface: CrawlStatusListener
Called by CrawlController when checkpointing.

Specified by:
crawlCheckpoint in interface CrawlStatusListener
Overrides:
crawlCheckpoint in class AbstractFrontier
Parameters:
checkpointDir - Checkpoint dir. Write checkpoint state here.
Throws:
java.lang.Exception - A fatal exception. Any exceptions that are let out of this checkpoint are assumed fatal and terminate further checkpoint processing.

dumpAllPendingToLog

public void dumpAllPendingToLog()
                         throws com.sleepycat.je.DatabaseException
Dump all still-enqueued URIs to the crawl.log -- without actually dequeuing. Useful for understanding what was remaining in a crawl that was ended early, for example at a time limit.

Throws:
com.sleepycat.je.DatabaseException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.