org.archive.crawler.filter
Class PathologicalPathFilter

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Filter
                      extended by org.archive.crawler.filter.URIRegExpFilter
                          extended by org.archive.crawler.filter.PathologicalPathFilter
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean

Deprecated. As of release 1.10.0. Replaced by DecidingFilter and equivalent DecideRule.

public class PathologicalPathFilter
extends URIRegExpFilter

Checks if a URI contains a repeated pattern. This filter is checking if a pattern is repeated a specific number of times. The use is to avoid crawler traps where the server adds the same pattern to the requested URI like: http://host/img/img/img/img..... This filter returns TRUE if the path is pathological. FALSE otherwise.

Author:
John Erik Halse
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_REPETITIONS
          Deprecated.  
static java.lang.Integer DEFAULT_REPETITIONS
          Deprecated.  
 
Fields inherited from class org.archive.crawler.filter.URIRegExpFilter
ATTR_MATCH_RETURN_VALUE, ATTR_REGEXP
 
Fields inherited from class org.archive.crawler.framework.Filter
ATTR_ENABLED
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
PathologicalPathFilter(java.lang.String name)
          Deprecated. Constructs a new PathologicalPathFilter.
 
Method Summary
protected  boolean getFilterOffPosition(CrawlURI curi)
          Deprecated. If the filter is disabled, the value returned by this method is what filters return as their disabled setting.
protected  java.lang.String getRegexp(java.lang.Object o)
          Deprecated. Construct the regexp string to be matched aginst the URI.
 
Methods inherited from class org.archive.crawler.filter.URIRegExpFilter
innerAccepts, returnTrueIfMatches
 
Methods inherited from class org.archive.crawler.framework.Filter
accepts, kickUpdate, toString
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_REPETITIONS

public static final java.lang.String ATTR_REPETITIONS
Deprecated. 
See Also:
Constant Field Values

DEFAULT_REPETITIONS

public static final java.lang.Integer DEFAULT_REPETITIONS
Deprecated. 
Constructor Detail

PathologicalPathFilter

public PathologicalPathFilter(java.lang.String name)
Deprecated. 
Constructs a new PathologicalPathFilter.

Parameters:
name - the name of the filter.
Method Detail

getRegexp

protected java.lang.String getRegexp(java.lang.Object o)
Deprecated. 
Construct the regexp string to be matched aginst the URI.

Overrides:
getRegexp in class URIRegExpFilter
Parameters:
o - an object to extract a URI from.
Returns:
the regexp pattern.

getFilterOffPosition

protected boolean getFilterOffPosition(CrawlURI curi)
Deprecated. 
Description copied from class: Filter
If the filter is disabled, the value returned by this method is what filters return as their disabled setting. Default is that we return 'true', continue processing, but some filters -- the exclude filters for example -- will want to return false if disabled so processing can continue.

Overrides:
getFilterOffPosition in class Filter
Parameters:
curi - CrawlURI to use as context. Passed curi can be null.
Returns:
This filters 'off' position.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.