|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Packages that use ModuleType | |
---|---|
org.archive.crawler.admin | Contains classes that the web UI uses to monitor and control crawls. |
org.archive.crawler.datamodel | |
org.archive.crawler.datamodel.credential | Contains html form login and basic and digest credentials used by Heritrix logging into sites. |
org.archive.crawler.deciderules | Provides classes for a simple decision rules framework. |
org.archive.crawler.deciderules.recrawl | |
org.archive.crawler.extractor | |
org.archive.crawler.fetcher | |
org.archive.crawler.filter | |
org.archive.crawler.framework | |
org.archive.crawler.frontier | |
org.archive.crawler.postprocessor | |
org.archive.crawler.prefetch | |
org.archive.crawler.processor | |
org.archive.crawler.processor.recrawl | |
org.archive.crawler.scope | |
org.archive.crawler.settings | Provides classes for the settings framework. |
org.archive.crawler.url.canonicalize | |
org.archive.crawler.writer |
Uses of ModuleType in org.archive.crawler.admin |
---|
Subclasses of ModuleType in org.archive.crawler.admin | |
---|---|
class |
StatisticsTracker
This is an implementation of the AbstractTracker. |
Uses of ModuleType in org.archive.crawler.datamodel |
---|
Subclasses of ModuleType in org.archive.crawler.datamodel | |
---|---|
class |
CrawlOrder
Represents the 'root' of the settings hierarchy. |
class |
CredentialStore
Front door to the credential store. |
class |
RobotsHonoringPolicy
RobotsHonoringPolicy represent the strategy used by the crawler for determining how robots.txt files will be honored. |
Uses of ModuleType in org.archive.crawler.datamodel.credential |
---|
Subclasses of ModuleType in org.archive.crawler.datamodel.credential | |
---|---|
class |
Credential
Credential type. |
class |
HtmlFormCredential
Credential that holds all needed to do a GET/POST to a HTML form. |
class |
Rfc2617Credential
A Basic/Digest auth RFC2617 credential. |
Uses of ModuleType in org.archive.crawler.deciderules |
---|
Subclasses of ModuleType in org.archive.crawler.deciderules | |
---|---|
class |
AcceptDecideRule
Rule which responds ACCEPT to anything passed in. |
class |
AddRedirectFromRootServerToScope
|
class |
BeanShellDecideRule
Rule which runs a groovy script to make its decision. |
class |
ClassKeyMatchesRegExpDecideRule
Rule applies configured decision to any CrawlURI class key -- i.e. |
class |
ConfiguredDecideRule
Rule which can be configured to ACCEPT or REJECT at operator's option. |
class |
ContentTypeMatchesRegExpDecideRule
DecideRule whose decision is applied if the URI's content-type is present and matches the supplied regular expression. |
class |
ContentTypeNotMatchesRegExpDecideRule
DecideRule whose decision is applied if the URI's content-type is present and does not match the supplied regular expression. |
class |
DecideRule
Interface for rules which, given an object to evaluate, respond with a decision: DecideRule.ACCEPT ,
DecideRule.REJECT , or
DecideRule.PASS . |
class |
DecideRuleSequence
RuleSequence represents a series of Rules, which are applied in turn to give the final result. |
class |
DecidingFilter
DecidingFilter: a classic Filter which makes its accept/reject decision based on whatever DecideRule s have been set up inside
it. |
class |
DecidingScope
DecidingScope: a Scope which makes its accept/reject decision based on whatever DecideRules have been set up inside it. |
class |
ExceedsDocumentLengthTresholdDecideRule
|
class |
ExternalGeoLocationDecideRule
A rule that can be configured to take alternate implementations of the ExternalGeoLocationInterface. |
class |
ExternalImplDecideRule
A rule that can be configured to take alternate implementations of the ExternalImplInterface. |
class |
FetchStatusDecideRule
Rule applies the configured decision for any URI which has a fetch status equal to the 'target-status' setting. |
class |
FetchStatusMatchesRegExpDecideRule
|
class |
FetchStatusNotMatchesRegExpDecideRule
|
class |
FilterDecideRule
FilterDecideRule wraps a legacy Filter for use in DecideRule contexts. |
class |
HasViaDecideRule
Rule applies the configured decision for any URI which has a 'via' (essentially, any URI that was a seed or some kinds of mid-crawl adds). |
class |
HopsPathMatchesRegExpDecideRule
Rule applies configured decision to any CrawlURIs whose 'hops-path' (string like "LLXE" etc.) matches the supplied regexp. |
class |
IsCrossTopmostAssignedSurtHopDecideRule
Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars (AKA its 'topmost assigned SURT' or 'public suffix'.) |
class |
MatchesFilePatternDecideRule
Compares suffix of a passed CrawlURI, UURI, or String against a regular expression pattern, applying its configured decision to all matches. |
class |
MatchesListRegExpDecideRule
Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexps. |
class |
MatchesRegExpDecideRule
Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexp. |
class |
NotExceedsDocumentLengthTresholdDecideRule
|
class |
NotMatchesFilePatternDecideRule
Rule applies configured decision to any URIs which do *not* match the supplied (file-pattern) regexp. |
class |
NotMatchesListRegExpDecideRule
Rule applies configured decision to any URIs which do *not* match the supplied regexp. |
class |
NotMatchesRegExpDecideRule
Rule applies configured decision to any URIs which do *not* match the supplied regexp. |
class |
NotOnDomainsDecideRule
Rule applies configured decision to any URIs that are *not* in one of the domains in the configured set of domains, filled from the seed set. |
class |
NotOnHostsDecideRule
Rule applies configured decision to any URIs that are *not* on one of the hosts in the configured set of hosts, filled from the seed set. |
class |
NotSurtPrefixedDecideRule
Rule applies configured decision to any URIs that, when expressed in SURT form, do *not* begin with one of the prefixes in the configured set. |
class |
OnDomainsDecideRule
Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set. |
class |
OnHostsDecideRule
Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set. |
class |
PathologicalPathDecideRule
Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a' segments) |
class |
PredicatedDecideRule
Rule which applies the configured decision only if a test evaluates to true. |
class |
PrerequisiteAcceptDecideRule
Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in the last hopsPath position). |
class |
QueueOverbudgetDecideRule
Applies configured decision to every candidate URI that would overbudget its queue. |
class |
RejectDecideRule
Rule which answers REJECT to everything evaluated. |
class |
ScopePlusOneDecideRule
Rule allows one level of discovery beyond configured scope (e.g. |
class |
SeedAcceptDecideRule
Rule which ACCEPTs all 'seed' URIs (those for which isSeed is true). |
class |
SurtPrefixedDecideRule
Rule applies configured decision to any URIs that, when expressed in SURT form, begin with one of the prefixes in the configured set. |
class |
TooManyHopsDecideRule
Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold. |
class |
TooManyPathSegmentsDecideRule
Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold. |
class |
TransclusionDecideRule
Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see CandidateURI.getPathFromSeed() ) ends
with at least one, but not more than, the given number of
non-navlink ('L') hops. |
Uses of ModuleType in org.archive.crawler.deciderules.recrawl |
---|
Subclasses of ModuleType in org.archive.crawler.deciderules.recrawl | |
---|---|
class |
IdenticalDigestDecideRule
Rule applies configured decision to any CrawlURIs whose prior-history content-digest matches the latest fetch. |
Uses of ModuleType in org.archive.crawler.extractor |
---|
Subclasses of ModuleType in org.archive.crawler.extractor | |
---|---|
class |
AggressiveExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regexp, and than by javascript speculative link regexp. |
class |
ChangeEvaluator
This processor compares the CrawlURI's current content digest
with the one from a previous crawl. |
class |
Extractor
Convenience shared superclass for Extractor Processors. |
class |
ExtractorCSS
This extractor is parsing URIs from CSS type files. |
class |
ExtractorDOC
This class allows the caller to extract href style links from word97-format word documents. |
class |
ExtractorHTML
Basic link-extraction, from an HTML content-body, using regular expressions. |
class |
ExtractorHTTP
Extracts URIs from HTTP response headers. |
class |
ExtractorImpliedURI
An extractor for finding 'implied' URIs inside other URIs. |
class |
ExtractorJS
Processes Javascript files for strings that are likely to be crawlable URIs. |
class |
ExtractorPDF
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs |
class |
ExtractorSWF
Process SWF (flash/shockwave) files for strings that are likely to be crawlable URIs. |
class |
ExtractorUniversal
A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link. |
class |
ExtractorURI
An extractor for finding URIs inside other URIs. |
class |
ExtractorXML
A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents) |
class |
HTTPContentDigest
A processor for calculating custum HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors. |
class |
JerichoExtractorHTML
Improved link-extraction from an HTML content-body using jericho-html parser. |
class |
TrapSuppressExtractor
Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'. |
Uses of ModuleType in org.archive.crawler.fetcher |
---|
Subclasses of ModuleType in org.archive.crawler.fetcher | |
---|---|
class |
FetchDNS
Processor to resolve 'dns:' URIs. |
class |
FetchFTP
Fetches documents and directory listings using FTP. |
class |
FetchHTTP
HTTP fetcher that uses Apache Jakarta Commons HttpClient library. |
Uses of ModuleType in org.archive.crawler.filter |
---|
Subclasses of ModuleType in org.archive.crawler.filter | |
---|---|
class |
ContentTypeRegExpFilter
Deprecated. As of release 1.10.0. To be replaced by an equivalent DecideRule . |
class |
FilePatternFilter
Deprecated. As of release 1.10.0. Replaced by MatchesFilePatternDecideRule . |
class |
HopsFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
equivalent DecideRule . |
class |
HTTPMidFetchUnchangedFilter
A mid fetch filter for HTTP fetcher processors. |
class |
OrFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
DecideRule . |
class |
PathDepthFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
equivalent DecideRule . |
class |
PathologicalPathFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
equivalent DecideRule . |
class |
SurtPrefixFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
equivalent DecideRule . |
class |
TransclusionFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
equivalent DecideRule . |
class |
URIListRegExpFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
equivalent DecideRule . |
class |
URIRegExpFilter
Deprecated. As of release 1.10.0. Replaced by DecidingFilter and
equivalent DecideRule . |
Uses of ModuleType in org.archive.crawler.framework |
---|
Subclasses of ModuleType in org.archive.crawler.framework | |
---|---|
class |
AbstractTracker
A partial implementation of the StatisticsTracking interface. |
class |
CrawlScope
A CrawlScope instance defines which URIs are "in" a particular crawl. |
class |
Filter
Base class for filter classes. |
class |
Processor
Base class for URI processing classes. |
class |
Scoper
Base class for Scopers. |
class |
WriterPoolProcessor
Abstract implementation of a file pool processor. |
Uses of ModuleType in org.archive.crawler.frontier |
---|
Subclasses of ModuleType in org.archive.crawler.frontier | |
---|---|
class |
AbstractFrontier
Shared facilities for Frontier implementations. |
class |
AdaptiveRevisitFrontier
A Frontier that will repeatedly visit all encountered URIs. |
class |
BdbFrontier
A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs. |
class |
DomainSensitiveFrontier
Deprecated. As of release 1.10.0. Replaced by BdbFrontier and
QuotaEnforcer . |
class |
WorkQueueFrontier
A common Frontier base using several queues to hold pending URIs. |
Uses of ModuleType in org.archive.crawler.postprocessor |
---|
Subclasses of ModuleType in org.archive.crawler.postprocessor | |
---|---|
class |
AcceptRevisitProcessor
Set a URI to be revisited by the ARFrontier. |
class |
ContentBasedWaitEvaluator
A WaitEvaluator that compares the CrawlURIs content type to a configurable regular expression. |
class |
CrawlStateUpdater
A step, late in the processing of a CrawlURI, for updating the per-host information that may have been affected by the fetch. |
class |
FrontierScheduler
'Schedule' with the Frontier CandidateURIs being carried by the passed CrawlURI. |
class |
ImageWaitEvaluator
A specialized ContentBasedWaitEvaluator. |
class |
LinksScoper
Determine which extracted links are within scope. |
class |
LowDiskPauseProcessor
Processor module which uses 'df -k', where available and with the expected output format (on Linux), to monitor available disk space and pause the crawl if free space on monitored filesystems falls below certain thresholds. |
class |
RejectRevisitProcessor
Set a URI to not be revisited by the ARFrontier. |
class |
SupplementaryLinksScoper
Run CandidateURI links carried in the passed CrawlURI through a filter and 'handle' rejections. |
class |
TextWaitEvaluator
A specialized ContentBasedWaitEvaluator. |
class |
WaitEvaluator
A processor that determines when a URI should be revisited next. |
Uses of ModuleType in org.archive.crawler.prefetch |
---|
Subclasses of ModuleType in org.archive.crawler.prefetch | |
---|---|
class |
PreconditionEnforcer
Ensures the preconditions for a fetch -- such as DNS lookup or acquiring and respecting a robots.txt policy -- are satisfied before a URI is passed to subsequent stages. |
class |
Preselector
If set to recheck the crawl's scope, gives a yes/no on whether a CrawlURI should be processed at all. |
class |
QuotaEnforcer
A simple quota enforcer. |
class |
RuntimeLimitEnforcer
A processor to enforce runtime limits on crawls. |
Uses of ModuleType in org.archive.crawler.processor |
---|
Subclasses of ModuleType in org.archive.crawler.processor | |
---|---|
class |
BeanShellProcessor
A processor which runs a BeanShell script on the CrawlURI. |
class |
CrawlMapper
A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). |
class |
HashCrawlMapper
Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey. |
class |
LexicalCrawlMapper
A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs between crawlers by diverting some range of URIs to local log files (which can then be imported to other crawlers). |
Uses of ModuleType in org.archive.crawler.processor.recrawl |
---|
Subclasses of ModuleType in org.archive.crawler.processor.recrawl | |
---|---|
class |
FetchHistoryProcessor
Maintain a history of fetch information inside the CrawlURI's attributes. |
class |
PersistLoadProcessor
Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl. |
class |
PersistLogProcessor
Log CrawlURI attributes from latest fetch for consultation by a later recrawl. |
class |
PersistOnlineProcessor
Common superclass for persisting Processors which directly store/load to persistence (as opposed to logging for batch load later). |
class |
PersistProcessor
Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence. |
class |
PersistStoreProcessor
Store CrawlURI attributes from latest fetch to persistent storage for consultation by a later recrawl. |
Uses of ModuleType in org.archive.crawler.scope |
---|
Subclasses of ModuleType in org.archive.crawler.scope | |
---|---|
class |
BroadScope
A CrawlScope instance defines which URIs are "in" a particular crawl. |
class |
ClassicScope
ClassicScope: superclass with shared Scope behavior for most common scopes. |
class |
DomainScope
Deprecated. As of release 1.10.0. Replaced by DecidingScope . |
class |
HostScope
Deprecated. As of release 1.10.0. Replaced by DecidingScope . |
class |
PathScope
Deprecated. As of release 1.10.0. Replaced by DecidingScope . |
class |
RefinedScope
Superclass for Scopes which make use of "additional focus" to add items by pattern, or want to swap in alternative transitive filter. |
class |
SeedCachingScope
A CrawlScope that caches its seed list for the convenience of scope-tests that are based on the seeds. |
class |
SurtPrefixScope
Deprecated. As of release 1.10.0. Replaced by DecidingScope . |
Uses of ModuleType in org.archive.crawler.settings |
---|
Methods in org.archive.crawler.settings that return ModuleType | |
---|---|
ModuleType |
SettingsHandler.getModule(java.lang.String name)
Get a module by name. |
ModuleType |
CrawlerSettings.getModule(java.lang.String name)
|
protected ModuleType |
CrawlerSettings.getTopLevelModule(java.lang.String name)
|
static ModuleType |
SettingsHandler.instantiateModuleTypeFromClassName(java.lang.String name,
java.lang.String className)
Instatiate a new ModuleType given its name and className. |
Methods in org.archive.crawler.settings with parameters of type ModuleType | |
---|---|
protected void |
CrawlerSettings.addTopLevelModule(ModuleType module)
|
Uses of ModuleType in org.archive.crawler.url.canonicalize |
---|
Subclasses of ModuleType in org.archive.crawler.url.canonicalize | |
---|---|
class |
BaseRule
Base of all rules applied canonicalizing a URL that are configurable via the Heritrix settings system. |
class |
FixupQueryStr
Strip any trailing question mark. |
class |
LowercaseRule
Lowercases the URL. |
class |
RegexRule
General conversion rule. |
class |
StripExtraSlashes
|
class |
StripSessionCFIDs
Strip cold fusion session ids. |
class |
StripSessionIDs
Strip known session ids. |
class |
StripUserinfoRule
Strip any 'userinfo' found on http/https URLs. |
class |
StripWWWNRule
Strip any 'www[0-9]*' found on http/https URLs IF they have some path/query component (content after third slash). |
class |
StripWWWRule
Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash). |
Uses of ModuleType in org.archive.crawler.writer |
---|
Subclasses of ModuleType in org.archive.crawler.writer | |
---|---|
class |
ARCWriterProcessor
Processor module for writing the results of successful fetches (and perhaps someday, certain kinds of network failures) to the Internet Archive ARC file format. |
class |
Kw3WriterProcessor
Processor module that writes the results of successful fetches to files on disk. |
class |
MirrorWriterProcessor
Processor module that writes the results of successful fetches to files on disk. |
class |
WARCWriterProcessor
WARCWriterProcessor. |
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |