org.archive.crawler.url.canonicalize
Class StripWWWRule

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.url.canonicalize.BaseRule
                      extended by org.archive.crawler.url.canonicalize.StripWWWRule
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CanonicalizationRule

public class StripWWWRule
extends BaseRule

Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash). (Top 'slash page' URIs are left unstripped, so that we prefer crawling redundant top pages to missing an entire site only available from either the www-full or www-less hostname, but not both).

Version:
$Date: 2006-09-25 20:27:35 +0000 (Mon, 25 Sep 2006) $, $Revision: 4655 $
Author:
stack
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
 
Fields inherited from class org.archive.crawler.url.canonicalize.BaseRule
ATTR_ENABLED
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
StripWWWRule(java.lang.String name)
           
 
Method Summary
 java.lang.String canonicalize(java.lang.String url, java.lang.Object context)
          Apply this canonicalization rule.
 
Methods inherited from class org.archive.crawler.url.canonicalize.BaseRule
doStripRegexMatch, isEnabled
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.archive.crawler.url.CanonicalizationRule
getName
 

Constructor Detail

StripWWWRule

public StripWWWRule(java.lang.String name)
Method Detail

canonicalize

public java.lang.String canonicalize(java.lang.String url,
                                     java.lang.Object context)
Description copied from interface: CanonicalizationRule
Apply this canonicalization rule.

Parameters:
url - Url string we apply this rule to.
context - An object that will provide context for the settings system. The UURI of the URL we're canonicalizing is an example of an object that provides context.
Returns:
Result of applying this rule to passed url.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.