org.archive.crawler.frontier
Class DomainSensitiveFrontier

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.frontier.AbstractFrontier
                      extended by org.archive.crawler.frontier.WorkQueueFrontier
                          extended by org.archive.crawler.frontier.BdbFrontier
                              extended by org.archive.crawler.frontier.DomainSensitiveFrontier
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, UriUniqFilter.HasUriReceiver, CrawlStatusListener, CrawlURIDispositionListener, Frontier, Reporter

Deprecated. As of release 1.10.0. Replaced by BdbFrontier and QuotaEnforcer.

public class DomainSensitiveFrontier
extends BdbFrontier
implements CrawlURIDispositionListener

Behaves like BdbFrontier (i.e., a basic mostly breadth-first frontier), but with the addition that you can set the number of documents to download on a per site basis. Useful for case of frequent revisits of a site of frequent changes.

Choose the number of docs you want to download and specify the count in max-docs. If count-per-host is true, the default, then the crawler will download max-docs per host. If you create an override, the overridden max-docs count will be downloaded instead, whether it is higher or lower.

If count-per-host is false, then max-docs acts like the the crawl order max-docs and the crawler will download this total amount of docs only. Overrides will download max-docs total in the overridden domain.

Author:
Oskar Grenholm
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.frontier.WorkQueueFrontier
WorkQueueFrontier.WakeTask
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier
Frontier.FrontierGroup
 
Field Summary
static java.lang.String[] ATTR_AVAILABLE_MODES
          Deprecated.  
static java.lang.String ATTR_COUNTER_MODE
          Deprecated.  
static java.lang.String ATTR_MAX_DOCS
          Deprecated.  
static java.lang.String COUNT_DOMAIN
          Deprecated.  
static java.lang.String COUNT_HOST
          Deprecated.  
static java.lang.String COUNT_OVERRIDE
          Deprecated.  
static java.lang.String DEFAULT_MODE
          Deprecated.  
 
Fields inherited from class org.archive.crawler.frontier.BdbFrontier
ATTR_DUMP_PENDING_AT_CLOSE, ATTR_INCLUDED, pendingUris
 
Fields inherited from class org.archive.crawler.frontier.WorkQueueFrontier
ALL_NONEMPTY, ALL_QUEUES, allQueues, alreadyIncluded, ATTR_BALANCE_REPLENISH_AMOUNT, ATTR_COST_POLICY, ATTR_ERROR_PENALTY_AMOUNT, ATTR_HOLD_QUEUES, ATTR_QUEUE_TOTAL_BUDGET, ATTR_SNOOZE_DEACTIVATE_MS, ATTR_TARGET_READY_QUEUES_BACKLOG, AVAILABLE_COST_POLICIES, DEFAULT_BALANCE_REPLENISH_AMOUNT, DEFAULT_COST_POLICY, DEFAULT_ERROR_PENALTY_AMOUNT, DEFAULT_HOLD_QUEUES, DEFAULT_QUEUE_TOTAL_BUDGET, DEFAULT_SNOOZE_DEACTIVATE_MS, DEFAULT_TARGET_READY_QUEUES_BACKLOG, inactiveQueues, inProcessQueues, longestActiveQueue, nextWake, readyClassQueues, readyFiller, REPORTS, retiredQueues, snoozedClassQueues, STANDARD_REPORT, targetSizeForReadyQueues, wakeTimer
 
Fields inherited from class org.archive.crawler.frontier.AbstractFrontier
ACCEPTABLE_FORCE_QUEUE, ATTR_DELAY_FACTOR, ATTR_FORCE_QUEUE, ATTR_MAX_DELAY, ATTR_MAX_HOST_BANDWIDTH_USAGE, ATTR_MAX_OVERALL_BANDWIDTH_USAGE, ATTR_MAX_RETRIES, ATTR_MIN_DELAY, ATTR_PAUSE_AT_FINISH, ATTR_PAUSE_AT_START, ATTR_PREFERENCE_EMBED_HOPS, ATTR_QUEUE_ASSIGNMENT_POLICY, ATTR_RECOVERY_ENABLED, ATTR_RESPECT_CRAWL_DELAY_UP_TO_SECS, ATTR_RETRY_DELAY, ATTR_SOURCE_TAG_SEEDS, controller, DEFAULT_ATTR_RECOVERY_ENABLED, DEFAULT_DELAY_FACTOR, DEFAULT_FORCE_QUEUE, DEFAULT_MAX_DELAY, DEFAULT_MAX_HOST_BANDWIDTH_USAGE, DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE, DEFAULT_MAX_RETRIES, DEFAULT_MIN_DELAY, DEFAULT_PAUSE_AT_FINISH, DEFAULT_PAUSE_AT_START, DEFAULT_PREFERENCE_EMBED_HOPS, DEFAULT_RESPECT_CRAWL_DELAY_UP_TO_SECS, DEFAULT_RETRY_DELAY, DEFAULT_SOURCE_TAG_SEEDS, disregardedUriCount, failedFetchCount, IGNORED_SEEDS_FILENAME, lastMaxBandwidthKB, liveDisregardedUriCount, liveFailedFetchCount, liveQueuedUriCount, liveSucceededFetchCount, nextOrdinal, processedBytesAfterLastEmittedURI, queuedUriCount, shouldPause, shouldTerminate, succeededFetchCount, totalProcessedBytes
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.framework.Frontier
ATTR_NAME
 
Constructor Summary
DomainSensitiveFrontier(java.lang.String name)
          Deprecated.  
 
Method Summary
 void crawledURIDisregard(CrawlURI curi)
          Deprecated. Notification of a crawled URI that is to be disregarded.
 void crawledURIFailure(CrawlURI curi)
          Deprecated. Notification of a failed crawling of a URI.
 void crawledURINeedRetry(CrawlURI curi)
          Deprecated. Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems).
 void crawledURISuccessful(CrawlURI curi)
          Deprecated. Notification of a successfully crawled URI
protected  void incrementHostCounters(CrawlURI curi)
          Deprecated.  
 void initialize(CrawlController c)
          Deprecated. Initializes the Frontier, given the supplied CrawlController.
 
Methods inherited from class org.archive.crawler.frontier.BdbFrontier
closeQueue, crawlCheckpoint, crawlEnded, createAlreadyIncluded, deserializeAlreadySeen, dumpAllPendingToLog, finalTasks, getInitialMarker, getQueueFor, getQueueFor, getURIsList, getWorkQueues, initQueue, initQueuesOfQueues, reinit, workQueueDataOnDisk
 
Methods inherited from class org.archive.crawler.frontier.WorkQueueFrontier
appendQueueReports, asCrawlUri, averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, deleteURIs, discoveredUriCount, finished, forceWakeQueues, forget, getGroup, getReports, isEmpty, kickUpdate, next, receive, reportTo, schedule, sendToQueue, singleLineLegend, singleLineReportTo, wakeQueues, wakeQueuesAsIfAtTime
 
Methods inherited from class org.archive.crawler.frontier.AbstractFrontier
applySpecialHandling, canonicalize, canonicalize, crawlEnding, crawlPaused, crawlPausing, crawlResuming, crawlStarted, decrementQueuedCount, disregardedUriCount, doJournalAdded, doJournalDisregarded, doJournalEmitted, doJournalFinishedFailure, doJournalFinishedSuccess, doJournalRescheduled, failedFetchCount, finishedUriCount, getClassKey, getFrontierJournal, getQueueAssignmentPolicy, getServer, importRecoverLog, incrementDisregardedUriCount, incrementFailedFetchCount, incrementQueuedUriCount, incrementQueuedUriCount, incrementSucceededFetchCount, isDisregarded, loadSeeds, log, logLocalizedErrors, needsRetrying, noteAboutToEmit, overMaxRetries, pause, politenessDelayFor, preNext, queuedUriCount, reportTo, retryDelayFor, saveIgnoredItems, scratchDirFor, singleLineReport, start, succeededFetchCount, tally, terminate, totalBytesWritten, unpause
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_MAX_DOCS

public static final java.lang.String ATTR_MAX_DOCS
Deprecated. 
See Also:
Constant Field Values

ATTR_COUNTER_MODE

public static final java.lang.String ATTR_COUNTER_MODE
Deprecated. 
See Also:
Constant Field Values

COUNT_OVERRIDE

public static final java.lang.String COUNT_OVERRIDE
Deprecated. 
See Also:
Constant Field Values

COUNT_HOST

public static final java.lang.String COUNT_HOST
Deprecated. 
See Also:
Constant Field Values

COUNT_DOMAIN

public static final java.lang.String COUNT_DOMAIN
Deprecated. 
See Also:
Constant Field Values

ATTR_AVAILABLE_MODES

public static final java.lang.String[] ATTR_AVAILABLE_MODES
Deprecated. 

DEFAULT_MODE

public static final java.lang.String DEFAULT_MODE
Deprecated. 
See Also:
Constant Field Values
Constructor Detail

DomainSensitiveFrontier

public DomainSensitiveFrontier(java.lang.String name)
Deprecated. 
Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Deprecated. 
Description copied from class: WorkQueueFrontier
Initializes the Frontier, given the supplied CrawlController.

Specified by:
initialize in interface Frontier
Overrides:
initialize in class BdbFrontier
Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.
See Also:
Frontier.initialize(org.archive.crawler.framework.CrawlController)

incrementHostCounters

protected void incrementHostCounters(CrawlURI curi)
Deprecated. 

crawledURISuccessful

public void crawledURISuccessful(CrawlURI curi)
Deprecated. 
Description copied from interface: CrawlURIDispositionListener
Notification of a successfully crawled URI

Specified by:
crawledURISuccessful in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI

crawledURINeedRetry

public void crawledURINeedRetry(CrawlURI curi)
Deprecated. 
Description copied from interface: CrawlURIDispositionListener
Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems).

Specified by:
crawledURINeedRetry in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI

crawledURIDisregard

public void crawledURIDisregard(CrawlURI curi)
Deprecated. 
Description copied from interface: CrawlURIDispositionListener
Notification of a crawled URI that is to be disregarded. Usually this means that the robots.txt file for the relevant site forbids this from being crawled and we are therefor not going to keep it. Other reasons may apply. In all cases this means that it was successfully downloaded but will not be stored.

Specified by:
crawledURIDisregard in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI

crawledURIFailure

public void crawledURIFailure(CrawlURI curi)
Deprecated. 
Description copied from interface: CrawlURIDispositionListener
Notification of a failed crawling of a URI. The failure is of a type that precludes retries (either by it's very nature or because it has been retried to many times)

Specified by:
crawledURIFailure in interface CrawlURIDispositionListener
Parameters:
curi - The relevant CrawlURI


Copyright © 2003-2011 Internet Archive. All Rights Reserved.