Uses of Class
org.archive.crawler.settings.ModuleType

Packages that use ModuleType
org.archive.crawler.admin Contains classes that the web UI uses to monitor and control crawls. 
org.archive.crawler.datamodel   
org.archive.crawler.datamodel.credential Contains html form login and basic and digest credentials used by Heritrix logging into sites. 
org.archive.crawler.deciderules Provides classes for a simple decision rules framework. 
org.archive.crawler.deciderules.recrawl   
org.archive.crawler.extractor   
org.archive.crawler.fetcher   
org.archive.crawler.filter   
org.archive.crawler.framework   
org.archive.crawler.frontier   
org.archive.crawler.postprocessor   
org.archive.crawler.prefetch   
org.archive.crawler.processor   
org.archive.crawler.processor.recrawl   
org.archive.crawler.scope   
org.archive.crawler.settings Provides classes for the settings framework. 
org.archive.crawler.url.canonicalize   
org.archive.crawler.writer   
 

Uses of ModuleType in org.archive.crawler.admin
 

Subclasses of ModuleType in org.archive.crawler.admin
 class StatisticsTracker
          This is an implementation of the AbstractTracker.
 

Uses of ModuleType in org.archive.crawler.datamodel
 

Subclasses of ModuleType in org.archive.crawler.datamodel
 class CrawlOrder
          Represents the 'root' of the settings hierarchy.
 class CredentialStore
          Front door to the credential store.
 class RobotsHonoringPolicy
          RobotsHonoringPolicy represent the strategy used by the crawler for determining how robots.txt files will be honored.
 

Uses of ModuleType in org.archive.crawler.datamodel.credential
 

Subclasses of ModuleType in org.archive.crawler.datamodel.credential
 class Credential
          Credential type.
 class HtmlFormCredential
          Credential that holds all needed to do a GET/POST to a HTML form.
 class Rfc2617Credential
          A Basic/Digest auth RFC2617 credential.
 

Uses of ModuleType in org.archive.crawler.deciderules
 

Subclasses of ModuleType in org.archive.crawler.deciderules
 class AcceptDecideRule
          Rule which responds ACCEPT to anything passed in.
 class AddRedirectFromRootServerToScope
           
 class BeanShellDecideRule
          Rule which runs a groovy script to make its decision.
 class ClassKeyMatchesRegExpDecideRule
          Rule applies configured decision to any CrawlURI class key -- i.e.
 class ConfiguredDecideRule
          Rule which can be configured to ACCEPT or REJECT at operator's option.
 class ContentTypeMatchesRegExpDecideRule
          DecideRule whose decision is applied if the URI's content-type is present and matches the supplied regular expression.
 class ContentTypeNotMatchesRegExpDecideRule
          DecideRule whose decision is applied if the URI's content-type is present and does not match the supplied regular expression.
 class DecideRule
          Interface for rules which, given an object to evaluate, respond with a decision: DecideRule.ACCEPT, DecideRule.REJECT, or DecideRule.PASS.
 class DecideRuleSequence
          RuleSequence represents a series of Rules, which are applied in turn to give the final result.
 class DecidingFilter
          DecidingFilter: a classic Filter which makes its accept/reject decision based on whatever DecideRules have been set up inside it.
 class DecidingScope
          DecidingScope: a Scope which makes its accept/reject decision based on whatever DecideRules have been set up inside it.
 class ExceedsDocumentLengthTresholdDecideRule
           
 class ExternalGeoLocationDecideRule
          A rule that can be configured to take alternate implementations of the ExternalGeoLocationInterface.
 class ExternalImplDecideRule
          A rule that can be configured to take alternate implementations of the ExternalImplInterface.
 class FetchStatusDecideRule
          Rule applies the configured decision for any URI which has a fetch status equal to the 'target-status' setting.
 class FetchStatusMatchesRegExpDecideRule
           
 class FetchStatusNotMatchesRegExpDecideRule
           
 class FilterDecideRule
          FilterDecideRule wraps a legacy Filter for use in DecideRule contexts.
 class HasViaDecideRule
          Rule applies the configured decision for any URI which has a 'via' (essentially, any URI that was a seed or some kinds of mid-crawl adds).
 class HopsPathMatchesRegExpDecideRule
          Rule applies configured decision to any CrawlURIs whose 'hops-path' (string like "LLXE" etc.) matches the supplied regexp.
 class IsCrossTopmostAssignedSurtHopDecideRule
          Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars (AKA its 'topmost assigned SURT' or 'public suffix'.)
 class MatchesFilePatternDecideRule
          Compares suffix of a passed CrawlURI, UURI, or String against a regular expression pattern, applying its configured decision to all matches.
 class MatchesListRegExpDecideRule
          Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexps.
 class MatchesRegExpDecideRule
          Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexp.
 class NotExceedsDocumentLengthTresholdDecideRule
           
 class NotMatchesFilePatternDecideRule
          Rule applies configured decision to any URIs which do *not* match the supplied (file-pattern) regexp.
 class NotMatchesListRegExpDecideRule
          Rule applies configured decision to any URIs which do *not* match the supplied regexp.
 class NotMatchesRegExpDecideRule
          Rule applies configured decision to any URIs which do *not* match the supplied regexp.
 class NotOnDomainsDecideRule
          Rule applies configured decision to any URIs that are *not* in one of the domains in the configured set of domains, filled from the seed set.
 class NotOnHostsDecideRule
          Rule applies configured decision to any URIs that are *not* on one of the hosts in the configured set of hosts, filled from the seed set.
 class NotSurtPrefixedDecideRule
          Rule applies configured decision to any URIs that, when expressed in SURT form, do *not* begin with one of the prefixes in the configured set.
 class OnDomainsDecideRule
          Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set.
 class OnHostsDecideRule
          Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set.
 class PathologicalPathDecideRule
          Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a' segments)
 class PredicatedDecideRule
          Rule which applies the configured decision only if a test evaluates to true.
 class PrerequisiteAcceptDecideRule
          Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in the last hopsPath position).
 class QueueOverbudgetDecideRule
          Applies configured decision to every candidate URI that would overbudget its queue.
 class RejectDecideRule
          Rule which answers REJECT to everything evaluated.
 class ScopePlusOneDecideRule
          Rule allows one level of discovery beyond configured scope (e.g.
 class SeedAcceptDecideRule
          Rule which ACCEPTs all 'seed' URIs (those for which isSeed is true).
 class SurtPrefixedDecideRule
          Rule applies configured decision to any URIs that, when expressed in SURT form, begin with one of the prefixes in the configured set.
 class TooManyHopsDecideRule
          Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold.
 class TooManyPathSegmentsDecideRule
          Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold.
 class TransclusionDecideRule
          Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see CandidateURI.getPathFromSeed()) ends with at least one, but not more than, the given number of non-navlink ('L') hops.
 

Uses of ModuleType in org.archive.crawler.deciderules.recrawl
 

Subclasses of ModuleType in org.archive.crawler.deciderules.recrawl
 class IdenticalDigestDecideRule
          Rule applies configured decision to any CrawlURIs whose prior-history content-digest matches the latest fetch.
 

Uses of ModuleType in org.archive.crawler.extractor
 

Subclasses of ModuleType in org.archive.crawler.extractor
 class AggressiveExtractorHTML
          Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp.
 class ChangeEvaluator
          This processor compares the CrawlURI's current content digest with the one from a previous crawl.
 class Extractor
          Convenience shared superclass for Extractor Processors.
 class ExtractorCSS
          This extractor is parsing URIs from CSS type files.
 class ExtractorDOC
          This class allows the caller to extract href style links from word97-format word documents.
 class ExtractorHTML
          Basic link-extraction, from an HTML content-body, using regular expressions.
 class ExtractorHTTP
          Extracts URIs from HTTP response headers.
 class ExtractorImpliedURI
          An extractor for finding 'implied' URIs inside other URIs.
 class ExtractorJS
          Processes Javascript files for strings that are likely to be crawlable URIs.
 class ExtractorPDF
          Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
 class ExtractorSWF
          Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs.
 class ExtractorUniversal
          A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
 class ExtractorURI
          An extractor for finding URIs inside other URIs.
 class ExtractorXML
          A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents)
 class HTTPContentDigest
          A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
 class JerichoExtractorHTML
          Improved link-extraction from an HTML content-body using jericho-html parser.
 class TrapSuppressExtractor
          Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.
 

Uses of ModuleType in org.archive.crawler.fetcher
 

Subclasses of ModuleType in org.archive.crawler.fetcher
 class FetchDNS
          Processor to resolve 'dns:' URIs.
 class FetchFTP
          Fetches documents and directory listings using FTP.
 class FetchHTTP
          HTTP fetcher that uses Apache Jakarta Commons HttpClient library.
 

Uses of ModuleType in org.archive.crawler.filter
 

Subclasses of ModuleType in org.archive.crawler.filter
 class ContentTypeRegExpFilter
          Deprecated. As of release 1.10.0. To be replaced by an equivalent DecideRule.
 class FilePatternFilter
          Deprecated. As of release 1.10.0. Replaced by MatchesFilePatternDecideRule.
 class HopsFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.
 class HTTPMidFetchUnchangedFilter
          A mid fetch filter for HTTP fetcher processors.
 class OrFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and DecideRule.
 class PathDepthFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.
 class PathologicalPathFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.
 class SurtPrefixFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.
 class TransclusionFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.
 class URIListRegExpFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.
 class URIRegExpFilter
          Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.
 

Uses of ModuleType in org.archive.crawler.framework
 

Subclasses of ModuleType in org.archive.crawler.framework
 class AbstractTracker
          A partial implementation of the StatisticsTracking interface.
 class CrawlScope
          A CrawlScope instance defines which URIs are "in" a particular crawl.
 class Filter
          Base class for filter classes.
 class Processor
          Base class for URI processing classes.
 class Scoper
          Base class for Scopers.
 class WriterPoolProcessor
          Abstract implementation of a file pool processor.
 

Uses of ModuleType in org.archive.crawler.frontier
 

Subclasses of ModuleType in org.archive.crawler.frontier
 class AbstractFrontier
          Shared facilities for Frontier implementations.
 class AdaptiveRevisitFrontier
          A Frontier that will repeatedly visit all encountered URIs.
 class BdbFrontier
          A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.
 class DomainSensitiveFrontier
          Deprecated. As of release 1.10.0. Replaced by BdbFrontier and QuotaEnforcer.
 class WorkQueueFrontier
          A common Frontier base using several queues to hold pending URIs.
 

Uses of ModuleType in org.archive.crawler.postprocessor
 

Subclasses of ModuleType in org.archive.crawler.postprocessor
 class AcceptRevisitProcessor
          Set a URI to be revisited by the ARFrontier.
 class ContentBasedWaitEvaluator
          A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression.
 class CrawlStateUpdater
          A step, late in the processing of a CrawlURI, for updating the per-host information that may have been affected by the fetch.
 class FrontierScheduler
          'Schedule' with the Frontier CandidateURIs being carried by the passed CrawlURI.
 class ImageWaitEvaluator
          A specialized ContentBasedWaitEvaluator.
 class LinksScoper
          Determine which extracted links are within scope.
 class LowDiskPauseProcessor
          Processor module which uses 'df -k', where available and with the expected output format (on Linux), to monitor available disk space and pause the crawl if free space on monitored filesystems falls below certain thresholds.
 class RejectRevisitProcessor
          Set a URI to not be revisited by the ARFrontier.
 class SupplementaryLinksScoper
          Run CandidateURI links carried in the passed CrawlURI through a filter and 'handle' rejections.
 class TextWaitEvaluator
          A specialized ContentBasedWaitEvaluator.
 class WaitEvaluator
          A processor that determines when a URI should be revisited next.
 

Uses of ModuleType in org.archive.crawler.prefetch
 

Subclasses of ModuleType in org.archive.crawler.prefetch
 class PreconditionEnforcer
          Ensures the preconditions for a fetch -- such as DNS lookup or acquiring and respecting a robots.txt policy -- are satisfied before a URI is passed to subsequent stages.
 class Preselector
          If set to recheck the crawl's scope, gives a yes/no on whether a CrawlURI should be processed at all.
 class QuotaEnforcer
          A simple quota enforcer.
 class RuntimeLimitEnforcer
          A processor to enforce runtime limits on crawls.
 

Uses of ModuleType in org.archive.crawler.processor
 

Subclasses of ModuleType in org.archive.crawler.processor
 class BeanShellProcessor
          A processor which runs a BeanShell script on the CrawlURI.
 class CrawlMapper
          A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).
 class HashCrawlMapper
          Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey.
 class LexicalCrawlMapper
          A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).
 

Uses of ModuleType in org.archive.crawler.processor.recrawl
 

Subclasses of ModuleType in org.archive.crawler.processor.recrawl
 class FetchHistoryProcessor
          Maintain a history of fetch information inside the CrawlURI's attributes.
 class PersistLoadProcessor
          Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
 class PersistLogProcessor
          Log CrawlURI attributes from latest fetch for consultation by a later recrawl.
 class PersistOnlineProcessor
          Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later).
 class PersistProcessor
          Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence.
 class PersistStoreProcessor
          Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
 

Uses of ModuleType in org.archive.crawler.scope
 

Subclasses of ModuleType in org.archive.crawler.scope
 class BroadScope
          A CrawlScope instance defines which URIs are "in" a particular crawl.
 class ClassicScope
          ClassicScope: superclass with shared Scope behavior for most common scopes.
 class DomainScope
          Deprecated. As of release 1.10.0. Replaced by DecidingScope.
 class HostScope
          Deprecated. As of release 1.10.0. Replaced by DecidingScope.
 class PathScope
          Deprecated. As of release 1.10.0. Replaced by DecidingScope.
 class RefinedScope
          Superclass for Scopes which make use of "additional focus" to add items by pattern, or want to swap in alternative transitive filter.
 class SeedCachingScope
          A CrawlScope that caches its seed list for the convenience of scope-tests that are based on the seeds.
 class SurtPrefixScope
          Deprecated. As of release 1.10.0. Replaced by DecidingScope.
 

Uses of ModuleType in org.archive.crawler.settings
 

Methods in org.archive.crawler.settings that return ModuleType
 ModuleType SettingsHandler.getModule(java.lang.String name)
          Get a module by name.
 ModuleType CrawlerSettings.getModule(java.lang.String name)
           
protected  ModuleType CrawlerSettings.getTopLevelModule(java.lang.String name)
           
static ModuleType SettingsHandler.instantiateModuleTypeFromClassName(java.lang.String name, java.lang.String className)
          Instatiate a new ModuleType given its name and className.
 

Methods in org.archive.crawler.settings with parameters of type ModuleType
protected  void CrawlerSettings.addTopLevelModule(ModuleType module)
           
 

Uses of ModuleType in org.archive.crawler.url.canonicalize
 

Subclasses of ModuleType in org.archive.crawler.url.canonicalize
 class BaseRule
          Base of all rules applied canonicalizing a URL that are configurable via the Heritrix settings system.
 class FixupQueryStr
          Strip any trailing question mark.
 class LowercaseRule
          Lowercases the URL.
 class RegexRule
          General conversion rule.
 class StripExtraSlashes
           
 class StripSessionCFIDs
          Strip cold fusion session ids.
 class StripSessionIDs
          Strip known session ids.
 class StripUserinfoRule
          Strip any 'userinfo' found on http/https URLs.
 class StripWWWNRule
          Strip any 'www[0-9]*' found on http/https URLs IF they have some path/query component (content after third slash).
 class StripWWWRule
          Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash).
 

Uses of ModuleType in org.archive.crawler.writer
 

Subclasses of ModuleType in org.archive.crawler.writer
 class ARCWriterProcessor
          Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format.
 class Kw3WriterProcessor
          Processor module that writes the results of successful fetches to files on disk.
 class MirrorWriterProcessor
          Processor module that writes the results of successful fetches to files on disk.
 class WARCWriterProcessor
          WARCWriterProcessor.
 



Copyright © 2003-2011 Internet Archive. All Rights Reserved.