|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javax.management.Attribute org.archive.crawler.settings.Type org.archive.crawler.settings.ComplexType org.archive.crawler.settings.ModuleType org.archive.crawler.frontier.AbstractFrontier org.archive.crawler.frontier.WorkQueueFrontier org.archive.crawler.frontier.BdbFrontier org.archive.crawler.frontier.DomainSensitiveFrontier
BdbFrontier
and
QuotaEnforcer
.
public class DomainSensitiveFrontier
Behaves like BdbFrontier
(i.e., a basic mostly breadth-first
frontier), but with the addition that you can set the number of documents
to download on a per site basis.
Useful for case of frequent revisits of a site of frequent changes.
Choose the number of docs you want to download and specify
the count in max-docs
. If count-per-host
is
true, the default, then the crawler will download max-docs
per host. If you create an override, the overridden max-docs
count will be downloaded instead, whether it is higher or lower.
If count-per-host
is false, then max-docs
acts like the the crawl order max-docs
and the crawler will
download this total amount of docs only. Overrides will
download max-docs
total in the overridden domain.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.crawler.frontier.WorkQueueFrontier |
---|
WorkQueueFrontier.WakeTask |
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier |
---|
Frontier.FrontierGroup |
Field Summary | |
---|---|
static java.lang.String[] |
ATTR_AVAILABLE_MODES
Deprecated. |
static java.lang.String |
ATTR_COUNTER_MODE
Deprecated. |
static java.lang.String |
ATTR_MAX_DOCS
Deprecated. |
static java.lang.String |
COUNT_DOMAIN
Deprecated. |
static java.lang.String |
COUNT_HOST
Deprecated. |
static java.lang.String |
COUNT_OVERRIDE
Deprecated. |
static java.lang.String |
DEFAULT_MODE
Deprecated. |
Fields inherited from class org.archive.crawler.frontier.BdbFrontier |
---|
ATTR_DUMP_PENDING_AT_CLOSE, ATTR_INCLUDED, pendingUris |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Fields inherited from interface org.archive.crawler.framework.Frontier |
---|
ATTR_NAME |
Constructor Summary | |
---|---|
DomainSensitiveFrontier(java.lang.String name)
Deprecated. |
Method Summary | |
---|---|
void |
crawledURIDisregard(CrawlURI curi)
Deprecated. Notification of a crawled URI that is to be disregarded. |
void |
crawledURIFailure(CrawlURI curi)
Deprecated. Notification of a failed crawling of a URI. |
void |
crawledURINeedRetry(CrawlURI curi)
Deprecated. Notification of a failed crawl of a URI that will be retried (failure due to possible transient problems). |
void |
crawledURISuccessful(CrawlURI curi)
Deprecated. Notification of a successfully crawled URI |
protected void |
incrementHostCounters(CrawlURI curi)
Deprecated. |
void |
initialize(CrawlController c)
Deprecated. Initializes the Frontier, given the supplied CrawlController. |
Methods inherited from class org.archive.crawler.frontier.BdbFrontier |
---|
closeQueue, crawlCheckpoint, crawlEnded, createAlreadyIncluded, deserializeAlreadySeen, dumpAllPendingToLog, finalTasks, getInitialMarker, getQueueFor, getQueueFor, getURIsList, getWorkQueues, initQueue, initQueuesOfQueues, reinit, workQueueDataOnDisk |
Methods inherited from class org.archive.crawler.frontier.WorkQueueFrontier |
---|
appendQueueReports, asCrawlUri, averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, deleteURIs, discoveredUriCount, finished, forceWakeQueues, forget, getGroup, getReports, isEmpty, kickUpdate, next, receive, reportTo, schedule, sendToQueue, singleLineLegend, singleLineReportTo, wakeQueues, wakeQueuesAsIfAtTime |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement, listUsedFiles |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String ATTR_MAX_DOCS
public static final java.lang.String ATTR_COUNTER_MODE
public static final java.lang.String COUNT_OVERRIDE
public static final java.lang.String COUNT_HOST
public static final java.lang.String COUNT_DOMAIN
public static final java.lang.String[] ATTR_AVAILABLE_MODES
public static final java.lang.String DEFAULT_MODE
Constructor Detail |
---|
public DomainSensitiveFrontier(java.lang.String name)
Method Detail |
---|
public void initialize(CrawlController c) throws FatalConfigurationException, java.io.IOException
WorkQueueFrontier
initialize
in interface Frontier
initialize
in class BdbFrontier
c
- The CrawlController that created the Frontier.
FatalConfigurationException
- If provided settings are illegal or
otherwise unusable.
java.io.IOException
- If there is a problem reading settings or seeds file
from disk.Frontier.initialize(org.archive.crawler.framework.CrawlController)
protected void incrementHostCounters(CrawlURI curi)
public void crawledURISuccessful(CrawlURI curi)
CrawlURIDispositionListener
crawledURISuccessful
in interface CrawlURIDispositionListener
curi
- The relevant CrawlURIpublic void crawledURINeedRetry(CrawlURI curi)
CrawlURIDispositionListener
crawledURINeedRetry
in interface CrawlURIDispositionListener
curi
- The relevant CrawlURIpublic void crawledURIDisregard(CrawlURI curi)
CrawlURIDispositionListener
crawledURIDisregard
in interface CrawlURIDispositionListener
curi
- The relevant CrawlURIpublic void crawledURIFailure(CrawlURI curi)
CrawlURIDispositionListener
crawledURIFailure
in interface CrawlURIDispositionListener
curi
- The relevant CrawlURI
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |