org.archive.util
Class UriUtils

java.lang.Object
  extended by org.archive.util.UriUtils

public class UriUtils
extends java.lang.Object

URI-related utilities. Primarily, a place to centralize and better document and test certain URI-related heuristics that may be useful in many places. The choice of when to consider a string likely enough to be a URI that we try crawling it is, so far, based on rather arbitrary rules-of-thumb. We have not quantitatively tested how often the strings that pass these tests yield meaningful (not 404, non-soft-404, non-garbage) replies. We are willing to accept some level of mistaken requests, knowing that their cost is usually negligible, if that allows us to discover meaningful content that could be not be discovered via other heuristics. Our intuitive understanding so far is that: strings that appear to have ./.. relative-path prefixes, dot-extensions, or path-slashes are good candidates for trying as URIs, even though with some Javascript/HTML-VALUE-attributes, this yields a lot of false positives. We want to get strings like.... photo.jpg /photos /photos/ ./photos ../../photos photos/index.html ...but we will thus also sometimes try strings that were other kinds of variables/ parameters, like... rectangle.x 11.2px text/xml width:6.33 Until better rules, exception-blacklists or even site-sensitive dynamic adjustment of heuristics (eg: this site, guesses are yield 200s, keep guessing; this site, guesses are all 404s, stop guessing) are developed, crawl operators should monitor their crawls (and contact email) for cases where speculative crawling are generating many errors, and use settings like ExtractorHTML's 'extract-javascript' and 'extract-value-attributes' or disable of ExtractorJS entirely when they want to curtail those errors. The 'legacy' tests are those used in H1 at least through 1.14.4. They have some known problems, but are not yet being dropped until more experience with the 'new' isLikelyUri() test is collected (in H3). Enable the 'xest' methods of the UriUtilsTest class for details.


Field Summary
(package private) static java.lang.String LIKELY_URI_PATH
           
(package private) static java.lang.String NAIVE_LIKELY_URI_PATTERN
           
protected static java.lang.String[] NAIVE_URI_EXCEPTIONS
           
(package private) static java.lang.String STRING_URI_DETECTOR
           
protected static java.lang.String[] STRING_URI_DETECTOR_EXCEPTIONS
           
 
Constructor Summary
UriUtils()
           
 
Method Summary
static boolean isLikelyUri(java.lang.CharSequence candidate)
           
static boolean isLikelyUriHtmlContextLegacy(java.lang.CharSequence candidate)
           
static boolean isLikelyUriJavascriptContextLegacy(java.lang.CharSequence candidate)
           
static java.lang.String speculativeFixup(java.lang.String candidate, UURI base)
          Perform additional fixup of likely-URI Strings
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NAIVE_LIKELY_URI_PATTERN

static final java.lang.String NAIVE_LIKELY_URI_PATTERN
See Also:
Constant Field Values

NAIVE_URI_EXCEPTIONS

protected static final java.lang.String[] NAIVE_URI_EXCEPTIONS

STRING_URI_DETECTOR

static final java.lang.String STRING_URI_DETECTOR
See Also:
Constant Field Values

STRING_URI_DETECTOR_EXCEPTIONS

protected static final java.lang.String[] STRING_URI_DETECTOR_EXCEPTIONS

LIKELY_URI_PATH

static final java.lang.String LIKELY_URI_PATH
See Also:
Constant Field Values
Constructor Detail

UriUtils

public UriUtils()
Method Detail

isLikelyUri

public static boolean isLikelyUri(java.lang.CharSequence candidate)

speculativeFixup

public static java.lang.String speculativeFixup(java.lang.String candidate,
                                                UURI base)
Perform additional fixup of likely-URI Strings

Parameters:
string - detected candidate String
Returns:
String changed/decoded to increase likelihood it is a meaningful non-404 URI

isLikelyUriJavascriptContextLegacy

public static boolean isLikelyUriJavascriptContextLegacy(java.lang.CharSequence candidate)

isLikelyUriHtmlContextLegacy

public static boolean isLikelyUriHtmlContextLegacy(java.lang.CharSequence candidate)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.