|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javax.management.Attribute org.archive.crawler.settings.Type org.archive.crawler.settings.ComplexType org.archive.crawler.settings.ModuleType org.archive.crawler.framework.Filter org.archive.crawler.framework.CrawlScope
public class CrawlScope
A CrawlScope instance defines which URIs are "in" a particular crawl. It is essentially a Filter which determines, looking at the totality of information available about a CandidateURI/CrawlURI instamce, if that URI should be scheduled for crawling. Dynamic information inherent in the discovery of the URI -- such as the path by which it was discovered -- may be considered. Dynamic information which requires the consultation of external and potentially volatile information -- such as current robots.txt requests and the history of attempts to crawl the same URI -- should NOT be considered. Those potentially high-latency decisions should be made at another step.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Field Summary | |
---|---|
static java.lang.String |
ATTR_NAME
|
static java.lang.String |
ATTR_REREAD_SEEDS_ON_CONFIG
Whether every configu change should trigger a rereading of the original seeds spec/file. |
static java.lang.String |
ATTR_SEEDS
|
static java.lang.Boolean |
DEFAULT_REREAD_SEEDS_ON_CONFIG
|
protected java.util.Set<SeedListener> |
seedListeners
|
Fields inherited from class org.archive.crawler.framework.Filter |
---|
ATTR_ENABLED |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Constructor Summary | |
---|---|
CrawlScope()
Default constructor. |
|
CrawlScope(java.lang.String name)
Constructs a new CrawlScope. |
Method Summary | |
---|---|
boolean |
addSeed(CandidateURI curi)
Add a new seed to scope. |
void |
addSeedListener(SeedListener sl)
|
protected void |
checkClose(java.util.Iterator iter)
Convenience method to close SeedFileIterator, if appropriate. |
java.io.File |
getSeedfile()
|
void |
initialize(CrawlController controller)
Initialize is called just before the crawler starts to run. |
protected boolean |
isSameHost(UURI a,
UURI b)
|
protected boolean |
isSeed(java.lang.Object o)
Check if a URI is in the seeds. |
void |
kickUpdate()
Take note of a situation (such as settings edit) where involved reconfiguration (such as reading from external files) may be necessary. |
void |
listUsedFiles(java.util.List<java.lang.String> list)
Those Modules that use files on disk should list them all when this method is called. |
void |
refreshSeeds()
Refresh seeds. |
java.util.Iterator<UURI> |
seedsIterator()
Gets an iterator over all configured seeds. |
java.util.Iterator<UURI> |
seedsIterator(java.io.Writer ignoredItemWriter)
Gets an iterator over all configured seeds. |
java.lang.String |
toString()
|
Methods inherited from class org.archive.crawler.framework.Filter |
---|
accepts, getFilterOffPosition, innerAccepts, returnTrueIfMatches |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String ATTR_NAME
public static final java.lang.String ATTR_SEEDS
public static final java.lang.String ATTR_REREAD_SEEDS_ON_CONFIG
public static final java.lang.Boolean DEFAULT_REREAD_SEEDS_ON_CONFIG
protected java.util.Set<SeedListener> seedListeners
Constructor Detail |
---|
public CrawlScope(java.lang.String name)
name
- the name is ignored since it always have to be the value of
the constant ATT_NAME.public CrawlScope()
Method Detail |
---|
public void initialize(CrawlController controller)
ComplexType.earlyInitialize(CrawlerSettings)
.
controller
- Controller object.public java.lang.String toString()
toString
in class Filter
public void refreshSeeds()
public java.io.File getSeedfile()
protected boolean isSeed(java.lang.Object o)
o
- the URI to check.
protected boolean isSameHost(UURI a, UURI b)
a
- First UURI of compare.b
- Second UURI of compare.
public void listUsedFiles(java.util.List<java.lang.String> list)
ModuleType
Each file (as a string name with full path) should be added to the provided list.
Modules that do not use any files can safely ignore this method.
listUsedFiles
in class ModuleType
list
- The list to add files to.public void kickUpdate()
kickUpdate
in class Filter
public java.util.Iterator<UURI> seedsIterator()
public java.util.Iterator<UURI> seedsIterator(java.io.Writer ignoredItemWriter)
ignoredItemWriter
- optional writer to get ignored seed items report
protected void checkClose(java.util.Iterator iter)
iter
- Iterator to check if SeedFileIterator needing closingpublic boolean addSeed(CandidateURI curi)
This method is *not* sufficient to get the new seed scheduled in the Frontier for crawling -- it only affects the Scope's seed record (and decisions which flow from seeds).
curi
- CandidateUri to add
public void addSeedListener(SeedListener sl)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |