Uses of Class org.archive.crawler.settings.ModuleType (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Uses of Class
org.archive.crawler.settings.ModuleType

Packages that use ModuleType
org.archive.crawler.admin	Contains classes that the web UI uses to monitor and control crawls.
org.archive.crawler.datamodel
org.archive.crawler.datamodel.credential	Contains html form login and basic and digest credentials used by Heritrix logging into sites.
org.archive.crawler.deciderules	Provides classes for a simple decision rules framework.
org.archive.crawler.deciderules.recrawl
org.archive.crawler.extractor
org.archive.crawler.fetcher
org.archive.crawler.filter
org.archive.crawler.framework
org.archive.crawler.frontier
org.archive.crawler.postprocessor
org.archive.crawler.prefetch
org.archive.crawler.processor
org.archive.crawler.processor.recrawl
org.archive.crawler.scope
org.archive.crawler.settings	Provides classes for the settings framework.
org.archive.crawler.url.canonicalize
org.archive.crawler.writer

Uses of ModuleType in org.archive.crawler.admin

Subclasses of ModuleType in org.archive.crawler.admin
`class`	`StatisticsTracker` This is an implementation of the AbstractTracker.

Uses of ModuleType in org.archive.crawler.datamodel

Subclasses of ModuleType in org.archive.crawler.datamodel
`class`	`CrawlOrder` Represents the 'root' of the settings hierarchy.
`class`	`CredentialStore` Front door to the credential store.
`class`	`RobotsHonoringPolicy` RobotsHonoringPolicy represent the strategy used by the crawler for determining how robots.txt files will be honored.

Uses of ModuleType in org.archive.crawler.datamodel.credential

Subclasses of ModuleType in org.archive.crawler.datamodel.credential
`class`	`Credential` Credential type.
`class`	`HtmlFormCredential` Credential that holds all needed to do a GET/POST to a HTML form.
`class`	`Rfc2617Credential` A Basic/Digest auth RFC2617 credential.

Uses of ModuleType in org.archive.crawler.deciderules

Subclasses of ModuleType in org.archive.crawler.deciderules
`class`	`AcceptDecideRule` Rule which responds ACCEPT to anything passed in.
`class`	`AddRedirectFromRootServerToScope`
`class`	`BeanShellDecideRule` Rule which runs a groovy script to make its decision.
`class`	`ClassKeyMatchesRegExpDecideRule` Rule applies configured decision to any CrawlURI class key -- i.e.
`class`	`ConfiguredDecideRule` Rule which can be configured to ACCEPT or REJECT at operator's option.
`class`	`ContentTypeMatchesRegExpDecideRule` DecideRule whose decision is applied if the URI's content-type is present and matches the supplied regular expression.
`class`	`ContentTypeNotMatchesRegExpDecideRule` DecideRule whose decision is applied if the URI's content-type is present and does not match the supplied regular expression.
`class`	`DecideRule` Interface for rules which, given an object to evaluate, respond with a decision: `DecideRule.ACCEPT`, `DecideRule.REJECT`, or `DecideRule.PASS`.
`class`	`DecideRuleSequence` RuleSequence represents a series of Rules, which are applied in turn to give the final result.
`class`	`DecidingFilter` DecidingFilter: a classic Filter which makes its accept/reject decision based on whatever `DecideRule`s have been set up inside it.
`class`	`DecidingScope` DecidingScope: a Scope which makes its accept/reject decision based on whatever DecideRules have been set up inside it.
`class`	`ExceedsDocumentLengthTresholdDecideRule`
`class`	`ExternalGeoLocationDecideRule` A rule that can be configured to take alternate implementations of the ExternalGeoLocationInterface.
`class`	`ExternalImplDecideRule` A rule that can be configured to take alternate implementations of the ExternalImplInterface.
`class`	`FetchStatusDecideRule` Rule applies the configured decision for any URI which has a fetch status equal to the 'target-status' setting.
`class`	`FetchStatusMatchesRegExpDecideRule`
`class`	`FetchStatusNotMatchesRegExpDecideRule`
`class`	`FilterDecideRule` FilterDecideRule wraps a legacy Filter for use in DecideRule contexts.
`class`	`HasViaDecideRule` Rule applies the configured decision for any URI which has a 'via' (essentially, any URI that was a seed or some kinds of mid-crawl adds).
`class`	`HopsPathMatchesRegExpDecideRule` Rule applies configured decision to any CrawlURIs whose 'hops-path' (string like "LLXE" etc.) matches the supplied regexp.
`class`	`IsCrossTopmostAssignedSurtHopDecideRule` Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars (AKA its 'topmost assigned SURT' or 'public suffix'.)
`class`	`MatchesFilePatternDecideRule` Compares suffix of a passed CrawlURI, UURI, or String against a regular expression pattern, applying its configured decision to all matches.
`class`	`MatchesListRegExpDecideRule` Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexps.
`class`	`MatchesRegExpDecideRule` Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexp.
`class`	`NotExceedsDocumentLengthTresholdDecideRule`
`class`	`NotMatchesFilePatternDecideRule` Rule applies configured decision to any URIs which do not match the supplied (file-pattern) regexp.
`class`	`NotMatchesListRegExpDecideRule` Rule applies configured decision to any URIs which do not match the supplied regexp.
`class`	`NotMatchesRegExpDecideRule` Rule applies configured decision to any URIs which do not match the supplied regexp.
`class`	`NotOnDomainsDecideRule` Rule applies configured decision to any URIs that are not in one of the domains in the configured set of domains, filled from the seed set.
`class`	`NotOnHostsDecideRule` Rule applies configured decision to any URIs that are not on one of the hosts in the configured set of hosts, filled from the seed set.
`class`	`NotSurtPrefixedDecideRule` Rule applies configured decision to any URIs that, when expressed in SURT form, do not begin with one of the prefixes in the configured set.
`class`	`OnDomainsDecideRule` Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set.
`class`	`OnHostsDecideRule` Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set.
`class`	`PathologicalPathDecideRule` Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a' segments)
`class`	`PredicatedDecideRule` Rule which applies the configured decision only if a test evaluates to true.
`class`	`PrerequisiteAcceptDecideRule` Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in the last hopsPath position).
`class`	`QueueOverbudgetDecideRule` Applies configured decision to every candidate URI that would overbudget its queue.
`class`	`RejectDecideRule` Rule which answers REJECT to everything evaluated.
`class`	`ScopePlusOneDecideRule` Rule allows one level of discovery beyond configured scope (e.g.
`class`	`SeedAcceptDecideRule` Rule which ACCEPTs all 'seed' URIs (those for which isSeed is true).
`class`	`SurtPrefixedDecideRule` Rule applies configured decision to any URIs that, when expressed in SURT form, begin with one of the prefixes in the configured set.
`class`	`TooManyHopsDecideRule` Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold.
`class`	`TooManyPathSegmentsDecideRule` Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold.
`class`	`TransclusionDecideRule` Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see `CandidateURI.getPathFromSeed()`) ends with at least one, but not more than, the given number of non-navlink ('L') hops.

Uses of ModuleType in org.archive.crawler.deciderules.recrawl

Subclasses of ModuleType in org.archive.crawler.deciderules.recrawl
`class`	`IdenticalDigestDecideRule` Rule applies configured decision to any CrawlURIs whose prior-history content-digest matches the latest fetch.

Uses of ModuleType in org.archive.crawler.extractor

Subclasses of ModuleType in org.archive.crawler.extractor
`class`	`AggressiveExtractorHTML` Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp.
`class`	`ChangeEvaluator` This processor compares the CrawlURI's current `content digest` with the one from a previous crawl.
`class`	`Extractor` Convenience shared superclass for Extractor Processors.
`class`	`ExtractorCSS` This extractor is parsing URIs from CSS type files.
`class`	`ExtractorDOC` This class allows the caller to extract href style links from word97-format word documents.
`class`	`ExtractorHTML` Basic link-extraction, from an HTML content-body, using regular expressions.
`class`	`ExtractorHTTP` Extracts URIs from HTTP response headers.
`class`	`ExtractorImpliedURI` An extractor for finding 'implied' URIs inside other URIs.
`class`	`ExtractorJS` Processes Javascript files for strings that are likely to be crawlable URIs.
`class`	`ExtractorPDF` Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
`class`	`ExtractorSWF` Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs.
`class`	`ExtractorUniversal` A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
`class`	`ExtractorURI` An extractor for finding URIs inside other URIs.
`class`	`ExtractorXML` A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents)
`class`	`HTTPContentDigest` A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
`class`	`JerichoExtractorHTML` Improved link-extraction from an HTML content-body using jericho-html parser.
`class`	`TrapSuppressExtractor` Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.

Uses of ModuleType in org.archive.crawler.fetcher

Subclasses of ModuleType in org.archive.crawler.fetcher
`class`	`FetchDNS` Processor to resolve 'dns:' URIs.
`class`	`FetchFTP` Fetches documents and directory listings using FTP.
`class`	`FetchHTTP` HTTP fetcher that uses Apache Jakarta Commons HttpClient library.

Uses of ModuleType in org.archive.crawler.filter

Subclasses of ModuleType in org.archive.crawler.filter
`class`	`ContentTypeRegExpFilter` Deprecated. As of release 1.10.0. To be replaced by an equivalent `DecideRule`.
`class`	`FilePatternFilter` Deprecated. As of release 1.10.0. Replaced by `MatchesFilePatternDecideRule`.
`class`	`HopsFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and equivalent `DecideRule`.
`class`	`HTTPMidFetchUnchangedFilter` A mid fetch filter for HTTP fetcher processors.
`class`	`OrFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and `DecideRule`.
`class`	`PathDepthFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and equivalent `DecideRule`.
`class`	`PathologicalPathFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and equivalent `DecideRule`.
`class`	`SurtPrefixFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and equivalent `DecideRule`.
`class`	`TransclusionFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and equivalent `DecideRule`.
`class`	`URIListRegExpFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and equivalent `DecideRule`.
`class`	`URIRegExpFilter` Deprecated. As of release 1.10.0. Replaced by `DecidingFilter` and equivalent `DecideRule`.

Uses of ModuleType in org.archive.crawler.framework

Subclasses of ModuleType in org.archive.crawler.framework
`class`	`AbstractTracker` A partial implementation of the StatisticsTracking interface.
`class`	`CrawlScope` A CrawlScope instance defines which URIs are "in" a particular crawl.
`class`	`Filter` Base class for filter classes.
`class`	`Processor` Base class for URI processing classes.
`class`	`Scoper` Base class for Scopers.
`class`	`WriterPoolProcessor` Abstract implementation of a file pool processor.

Uses of ModuleType in org.archive.crawler.frontier

Subclasses of ModuleType in org.archive.crawler.frontier
`class`	`AbstractFrontier` Shared facilities for Frontier implementations.
`class`	`AdaptiveRevisitFrontier` A Frontier that will repeatedly visit all encountered URIs.
`class`	`BdbFrontier` A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.
`class`	`DomainSensitiveFrontier` Deprecated. As of release 1.10.0. Replaced by `BdbFrontier` and `QuotaEnforcer`.
`class`	`WorkQueueFrontier` A common Frontier base using several queues to hold pending URIs.

Uses of ModuleType in org.archive.crawler.postprocessor

Subclasses of ModuleType in org.archive.crawler.postprocessor
`class`	`AcceptRevisitProcessor` Set a URI to be revisited by the ARFrontier.
`class`	`ContentBasedWaitEvaluator` A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression.
`class`	`CrawlStateUpdater` A step, late in the processing of a CrawlURI, for updating the per-host information that may have been affected by the fetch.
`class`	`FrontierScheduler` 'Schedule' with the Frontier CandidateURIs being carried by the passed CrawlURI.
`class`	`ImageWaitEvaluator` A specialized ContentBasedWaitEvaluator.
`class`	`LinksScoper` Determine which extracted links are within scope.
`class`	`LowDiskPauseProcessor` Processor module which uses 'df -k', where available and with the expected output format (on Linux), to monitor available disk space and pause the crawl if free space on monitored filesystems falls below certain thresholds.
`class`	`RejectRevisitProcessor` Set a URI to not be revisited by the ARFrontier.
`class`	`SupplementaryLinksScoper` Run CandidateURI links carried in the passed CrawlURI through a filter and 'handle' rejections.
`class`	`TextWaitEvaluator` A specialized ContentBasedWaitEvaluator.
`class`	`WaitEvaluator` A processor that determines when a URI should be revisited next.

Uses of ModuleType in org.archive.crawler.prefetch

Subclasses of ModuleType in org.archive.crawler.prefetch
`class`	`PreconditionEnforcer` Ensures the preconditions for a fetch -- such as DNS lookup or acquiring and respecting a robots.txt policy -- are satisfied before a URI is passed to subsequent stages.
`class`	`Preselector` If set to recheck the crawl's scope, gives a yes/no on whether a CrawlURI should be processed at all.
`class`	`QuotaEnforcer` A simple quota enforcer.
`class`	`RuntimeLimitEnforcer` A processor to enforce runtime limits on crawls.

Uses of ModuleType in org.archive.crawler.processor

Subclasses of ModuleType in org.archive.crawler.processor
`class`	`BeanShellProcessor` A processor which runs a BeanShell script on the CrawlURI.
`class`	`CrawlMapper` A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).
`class`	`HashCrawlMapper` Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey.
`class`	`LexicalCrawlMapper` A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers).

Uses of ModuleType in org.archive.crawler.processor.recrawl

Subclasses of ModuleType in org.archive.crawler.processor.recrawl
`class`	`FetchHistoryProcessor` Maintain a history of fetch information inside the CrawlURI's attributes.
`class`	`PersistLoadProcessor` Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.
`class`	`PersistLogProcessor` Log CrawlURI attributes from latest fetch for consultation by a later recrawl.
`class`	`PersistOnlineProcessor` Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later).
`class`	`PersistProcessor` Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence.
`class`	`PersistStoreProcessor` Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl.

Uses of ModuleType in org.archive.crawler.scope

Subclasses of ModuleType in org.archive.crawler.scope
`class`	`BroadScope` A CrawlScope instance defines which URIs are "in" a particular crawl.
`class`	`ClassicScope` ClassicScope: superclass with shared Scope behavior for most common scopes.
`class`	`DomainScope` Deprecated. As of release 1.10.0. Replaced by `DecidingScope`.
`class`	`HostScope` Deprecated. As of release 1.10.0. Replaced by `DecidingScope`.
`class`	`PathScope` Deprecated. As of release 1.10.0. Replaced by `DecidingScope`.
`class`	`RefinedScope` Superclass for Scopes which make use of "additional focus" to add items by pattern, or want to swap in alternative transitive filter.
`class`	`SeedCachingScope` A CrawlScope that caches its seed list for the convenience of scope-tests that are based on the seeds.
`class`	`SurtPrefixScope` Deprecated. As of release 1.10.0. Replaced by `DecidingScope`.

Uses of ModuleType in org.archive.crawler.settings

Methods in org.archive.crawler.settings that return ModuleType
`ModuleType`	`SettingsHandler.getModule(java.lang.String name)` Get a module by name.
`ModuleType`	`CrawlerSettings.getModule(java.lang.String name)`
`protected ModuleType`	`CrawlerSettings.getTopLevelModule(java.lang.String name)`
`static ModuleType`	`SettingsHandler.instantiateModuleTypeFromClassName(java.lang.String name, java.lang.String className)` Instatiate a new ModuleType given its name and className.

Methods in org.archive.crawler.settings with parameters of type ModuleType
`protected void`	`CrawlerSettings.addTopLevelModule(ModuleType module)`

Uses of ModuleType in org.archive.crawler.url.canonicalize

Subclasses of ModuleType in org.archive.crawler.url.canonicalize
`class`	`BaseRule` Base of all rules applied canonicalizing a URL that are configurable via the Heritrix settings system.
`class`	`FixupQueryStr` Strip any trailing question mark.
`class`	`LowercaseRule` Lowercases the URL.
`class`	`RegexRule` General conversion rule.
`class`	`StripExtraSlashes`
`class`	`StripSessionCFIDs` Strip cold fusion session ids.
`class`	`StripSessionIDs` Strip known session ids.
`class`	`StripUserinfoRule` Strip any 'userinfo' found on http/https URLs.
`class`	`StripWWWNRule` Strip any 'www[0-9]*' found on http/https URLs IF they have some path/query component (content after third slash).
`class`	`StripWWWRule` Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash).

Uses of ModuleType in org.archive.crawler.writer

Subclasses of ModuleType in org.archive.crawler.writer
`class`	`ARCWriterProcessor` Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format.
`class`	`Kw3WriterProcessor` Processor module that writes the results of successful fetches to files on disk.
`class`	`MirrorWriterProcessor` Processor module that writes the results of successful fetches to files on disk.
`class`	`WARCWriterProcessor` WARCWriterProcessor.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Uses of Classorg.archive.crawler.settings.ModuleType

Uses of Class
org.archive.crawler.settings.ModuleType