org.archive.crawler.framework
Class CrawlScope

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Filter
                      extended by org.archive.crawler.framework.CrawlScope
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean
Direct Known Subclasses:
ClassicScope, DecidingScope

public class CrawlScope
extends Filter

A CrawlScope instance defines which URIs are "in" a particular crawl. It is essentially a Filter which determines, looking at the totality of information available about a CandidateURI/CrawlURI instamce, if that URI should be scheduled for crawling. Dynamic information inherent in the discovery of the URI -- such as the path by which it was discovered -- may be considered. Dynamic information which requires the consultation of external and potentially volatile information -- such as current robots.txt requests and the history of attempts to crawl the same URI -- should NOT be considered. Those potentially high-latency decisions should be made at another step.

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_NAME
           
static java.lang.String ATTR_REREAD_SEEDS_ON_CONFIG
          Whether every configu change should trigger a rereading of the original seeds spec/file.
static java.lang.String ATTR_SEEDS
           
static java.lang.Boolean DEFAULT_REREAD_SEEDS_ON_CONFIG
           
protected  java.util.Set<SeedListener> seedListeners
           
 
Fields inherited from class org.archive.crawler.framework.Filter
ATTR_ENABLED
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
CrawlScope()
          Default constructor.
CrawlScope(java.lang.String name)
          Constructs a new CrawlScope.
 
Method Summary
 boolean addSeed(CandidateURI curi)
          Add a new seed to scope.
 void addSeedListener(SeedListener sl)
           
protected  void checkClose(java.util.Iterator iter)
          Convenience method to close SeedFileIterator, if appropriate.
 java.io.File getSeedfile()
           
 void initialize(CrawlController controller)
          Initialize is called just before the crawler starts to run.
protected  boolean isSameHost(UURI a, UURI b)
           
protected  boolean isSeed(java.lang.Object o)
          Check if a URI is in the seeds.
 void kickUpdate()
          Take note of a situation (such as settings edit) where involved reconfiguration (such as reading from external files) may be necessary.
 void listUsedFiles(java.util.List<java.lang.String> list)
          Those Modules that use files on disk should list them all when this method is called.
 void refreshSeeds()
          Refresh seeds.
 java.util.Iterator<UURI> seedsIterator()
          Gets an iterator over all configured seeds.
 java.util.Iterator<UURI> seedsIterator(java.io.Writer ignoredItemWriter)
          Gets an iterator over all configured seeds.
 java.lang.String toString()
           
 
Methods inherited from class org.archive.crawler.framework.Filter
accepts, getFilterOffPosition, innerAccepts, returnTrueIfMatches
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_NAME

public static final java.lang.String ATTR_NAME
See Also:
Constant Field Values

ATTR_SEEDS

public static final java.lang.String ATTR_SEEDS
See Also:
Constant Field Values

ATTR_REREAD_SEEDS_ON_CONFIG

public static final java.lang.String ATTR_REREAD_SEEDS_ON_CONFIG
Whether every configu change should trigger a rereading of the original seeds spec/file.

See Also:
Constant Field Values

DEFAULT_REREAD_SEEDS_ON_CONFIG

public static final java.lang.Boolean DEFAULT_REREAD_SEEDS_ON_CONFIG

seedListeners

protected java.util.Set<SeedListener> seedListeners
Constructor Detail

CrawlScope

public CrawlScope(java.lang.String name)
Constructs a new CrawlScope.

Parameters:
name - the name is ignored since it always have to be the value of the constant ATT_NAME.

CrawlScope

public CrawlScope()
Default constructor.

Method Detail

initialize

public void initialize(CrawlController controller)
Initialize is called just before the crawler starts to run. The settings system is up and initialized so can be used. This initialize happens after ComplexType.earlyInitialize(CrawlerSettings).

Parameters:
controller - Controller object.

toString

public java.lang.String toString()
Overrides:
toString in class Filter

refreshSeeds

public void refreshSeeds()
Refresh seeds.


getSeedfile

public java.io.File getSeedfile()
Returns:
Seed list file or null if problem getting settings file.

isSeed

protected boolean isSeed(java.lang.Object o)
Check if a URI is in the seeds.

Parameters:
o - the URI to check.
Returns:
true if URI is a seed.

isSameHost

protected boolean isSameHost(UURI a,
                             UURI b)
Parameters:
a - First UURI of compare.
b - Second UURI of compare.
Returns:
True if UURIs are of same host.

listUsedFiles

public void listUsedFiles(java.util.List<java.lang.String> list)
Description copied from class: ModuleType
Those Modules that use files on disk should list them all when this method is called.

Each file (as a string name with full path) should be added to the provided list.

Modules that do not use any files can safely ignore this method.

Overrides:
listUsedFiles in class ModuleType
Parameters:
list - The list to add files to.

kickUpdate

public void kickUpdate()
Take note of a situation (such as settings edit) where involved reconfiguration (such as reading from external files) may be necessary.

Overrides:
kickUpdate in class Filter

seedsIterator

public java.util.Iterator<UURI> seedsIterator()
Gets an iterator over all configured seeds. Subclasses which cache seeds in memory can override with more efficient implementation.

Returns:
Iterator, perhaps over a disk file, of seeds

seedsIterator

public java.util.Iterator<UURI> seedsIterator(java.io.Writer ignoredItemWriter)
Gets an iterator over all configured seeds. Subclasses which cache seeds in memory can override with more efficient implementation.

Parameters:
ignoredItemWriter - optional writer to get ignored seed items report
Returns:
Iterator, perhaps over a disk file, of seeds

checkClose

protected void checkClose(java.util.Iterator iter)
Convenience method to close SeedFileIterator, if appropriate.

Parameters:
iter - Iterator to check if SeedFileIterator needing closing

addSeed

public boolean addSeed(CandidateURI curi)
Add a new seed to scope. By default, simply appends to seeds file, though subclasses may handle differently.

This method is *not* sufficient to get the new seed scheduled in the Frontier for crawling -- it only affects the Scope's seed record (and decisions which flow from seeds).

Parameters:
curi - CandidateUri to add
Returns:
true if successful, false if add failed for any reason

addSeedListener

public void addSeedListener(SeedListener sl)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.