Package org.archive.crawler.url.canonicalize

Class Summary
BaseRule Base of all rules applied canonicalizing a URL that are configurable via the Heritrix settings system.
FixupQueryStr Strip any trailing question mark.
LowercaseRule Lowercases the URL.
RegexRule General conversion rule.
StripExtraSlashes  
StripSessionCFIDs Strip cold fusion session ids.
StripSessionIDs Strip known session ids.
StripUserinfoRule Strip any 'userinfo' found on http/https URLs.
StripWWWNRule Strip any 'www[0-9]*' found on http/https URLs IF they have some path/query component (content after third slash).
StripWWWRule Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash).
 



Copyright © 2003-2011 Internet Archive. All Rights Reserved.