Package org.archive.crawler.deciderules

Provides classes for a simple decision rules framework.

See:
          Description

Interface Summary
ExternalGeoLookupInterface Interface used by ExternalImplDecideRule.
ExternalImplInterface Interface used by ExternalImplDecideRule.
 

Class Summary
AcceptDecideRule Rule which responds ACCEPT to anything passed in.
AddRedirectFromRootServerToScope  
BeanShellDecideRule Rule which runs a groovy script to make its decision.
ClassKeyMatchesRegExpDecideRule Rule applies configured decision to any CrawlURI class key -- i.e.
ConfiguredDecideRule Rule which can be configured to ACCEPT or REJECT at operator's option.
ContentTypeMatchesRegExpDecideRule DecideRule whose decision is applied if the URI's content-type is present and matches the supplied regular expression.
ContentTypeNotMatchesRegExpDecideRule DecideRule whose decision is applied if the URI's content-type is present and does not match the supplied regular expression.
DecideRule Interface for rules which, given an object to evaluate, respond with a decision: DecideRule.ACCEPT, DecideRule.REJECT, or DecideRule.PASS.
DecideRuleSequence RuleSequence represents a series of Rules, which are applied in turn to give the final result.
DecidingFilter DecidingFilter: a classic Filter which makes its accept/reject decision based on whatever DecideRules have been set up inside it.
DecidingScope DecidingScope: a Scope which makes its accept/reject decision based on whatever DecideRules have been set up inside it.
ExceedsDocumentLengthTresholdDecideRule  
ExternalGeoLocationDecideRule A rule that can be configured to take alternate implementations of the ExternalGeoLocationInterface.
ExternalImplDecideRule A rule that can be configured to take alternate implementations of the ExternalImplInterface.
FetchStatusDecideRule Rule applies the configured decision for any URI which has a fetch status equal to the 'target-status' setting.
FetchStatusMatchesRegExpDecideRule  
FetchStatusNotMatchesRegExpDecideRule  
FilterDecideRule FilterDecideRule wraps a legacy Filter for use in DecideRule contexts.
HasViaDecideRule Rule applies the configured decision for any URI which has a 'via' (essentially, any URI that was a seed or some kinds of mid-crawl adds).
HopsPathMatchesRegExpDecideRule Rule applies configured decision to any CrawlURIs whose 'hops-path' (string like "LLXE" etc.) matches the supplied regexp.
IsCrossTopmostAssignedSurtHopDecideRule Applies its decision if the current URI differs in that portion of its hostname/domain that is assigned/sold by registrars (AKA its 'topmost assigned SURT' or 'public suffix'.)
MatchesFilePatternDecideRule Compares suffix of a passed CrawlURI, UURI, or String against a regular expression pattern, applying its configured decision to all matches.
MatchesListRegExpDecideRule Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexps.
MatchesRegExpDecideRule Rule applies configured decision to any CrawlURIs whose String URI matches the supplied regexp.
NotExceedsDocumentLengthTresholdDecideRule  
NotMatchesFilePatternDecideRule Rule applies configured decision to any URIs which do *not* match the supplied (file-pattern) regexp.
NotMatchesListRegExpDecideRule Rule applies configured decision to any URIs which do *not* match the supplied regexp.
NotMatchesRegExpDecideRule Rule applies configured decision to any URIs which do *not* match the supplied regexp.
NotOnDomainsDecideRule Rule applies configured decision to any URIs that are *not* in one of the domains in the configured set of domains, filled from the seed set.
NotOnHostsDecideRule Rule applies configured decision to any URIs that are *not* on one of the hosts in the configured set of hosts, filled from the seed set.
NotSurtPrefixedDecideRule Rule applies configured decision to any URIs that, when expressed in SURT form, do *not* begin with one of the prefixes in the configured set.
OnDomainsDecideRule Rule applies configured decision to any URIs that are on one of the domains in the configured set of domains, filled from the seed set.
OnHostsDecideRule Rule applies configured decision to any URIs that are on one of the hosts in the configured set of hosts, filled from the seed set.
PathologicalPathDecideRule Rule REJECTs any URI which contains an excessive number of identical, consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a' segments)
PredicatedDecideRule Rule which applies the configured decision only if a test evaluates to true.
PrerequisiteAcceptDecideRule Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in the last hopsPath position).
QueueOverbudgetDecideRule Applies configured decision to every candidate URI that would overbudget its queue.
RejectDecideRule Rule which answers REJECT to everything evaluated.
ScopePlusOneDecideRule Rule allows one level of discovery beyond configured scope (e.g.
SeedAcceptDecideRule Rule which ACCEPTs all 'seed' URIs (those for which isSeed is true).
SurtPrefixedDecideRule Rule applies configured decision to any URIs that, when expressed in SURT form, begin with one of the prefixes in the configured set.
TooManyHopsDecideRule Rule REJECTs any CrawlURIs whose total number of hops (length of the hopsPath string, traversed links of any type) is over a threshold.
TooManyPathSegmentsDecideRule Rule REJECTs any CrawlURIs whose total number of path-segments (as indicated by the count of '/' characters not including the first '//') is over a given threshold.
TransclusionDecideRule Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see CandidateURI.getPathFromSeed()) ends with at least one, but not more than, the given number of non-navlink ('L') hops.
 

Package org.archive.crawler.deciderules Description

Provides classes for a simple decision rules framework.

Each 'step' in a decision rule set which can affect an objects ultimate fate is called a DecideRule. Each DecideRule renders a decision (possibly neutral) on the passed objects fate.

Possible decisions are:

As previously outlined, each DecideRule is applied in turn; the last one to express a non-PASS preference wins.

For example, if the rules are:

Then, you have a crawl that will go 3 hops (of any type) from the seeds, with a special affordance to get prerequisites of 3-hop items (which may be 4 "hops" out)

To allow this style of decision processing to be plugged into the existing Filter and Scope slots:

See NewScopingModel for background.



Copyright © 2003-2011 Internet Archive. All Rights Reserved.