org.archive.util
Class UriUtils
java.lang.Object
org.archive.util.UriUtils
public class UriUtils
- extends java.lang.Object
URI-related utilities.
Primarily, a place to centralize and better document and test certain URI-related heuristics
that may be useful in many places.
The choice of when to consider a string likely enough to be a URI that we try crawling it
is, so far, based on rather arbitrary rules-of-thumb. We have not quantitatively tested
how often the strings that pass these tests yield meaningful (not 404, non-soft-404,
non-garbage) replies. We are willing to accept some level of mistaken requests, knowing
that their cost is usually negligible, if that allows us to discover meaningful content
that could be not be discovered via other heuristics.
Our intuitive understanding so far is that: strings that appear to have ./.. relative-path
prefixes, dot-extensions, or path-slashes are good candidates for trying as URIs, even
though with some Javascript/HTML-VALUE-attributes, this yields a lot of false positives.
We want to get strings like....
photo.jpg
/photos
/photos/
./photos
../../photos
photos/index.html
...but we will thus also sometimes try strings that were other kinds of variables/
parameters, like...
rectangle.x
11.2px
text/xml
width:6.33
Until better rules, exception-blacklists or even site-sensitive dynamic adjustment of
heuristics (eg: this site, guesses are yield 200s, keep guessing; this site, guesses are
all 404s, stop guessing) are developed, crawl operators should monitor their crawls
(and contact email) for cases where speculative crawling are generating many errors, and
use settings like ExtractorHTML's 'extract-javascript' and 'extract-value-attributes' or
disable of ExtractorJS entirely when they want to curtail those errors.
The 'legacy' tests are those used in H1 at least through 1.14.4. They have
some known problems, but are not yet being dropped until more experience
with the 'new' isLikelyUri() test is collected (in H3). Enable the 'xest'
methods of the UriUtilsTest class for details.
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NAIVE_LIKELY_URI_PATTERN
static final java.lang.String NAIVE_LIKELY_URI_PATTERN
- See Also:
- Constant Field Values
NAIVE_URI_EXCEPTIONS
protected static final java.lang.String[] NAIVE_URI_EXCEPTIONS
STRING_URI_DETECTOR
static final java.lang.String STRING_URI_DETECTOR
- See Also:
- Constant Field Values
STRING_URI_DETECTOR_EXCEPTIONS
protected static final java.lang.String[] STRING_URI_DETECTOR_EXCEPTIONS
LIKELY_URI_PATH
static final java.lang.String LIKELY_URI_PATH
- See Also:
- Constant Field Values
UriUtils
public UriUtils()
isLikelyUri
public static boolean isLikelyUri(java.lang.CharSequence candidate)
speculativeFixup
public static java.lang.String speculativeFixup(java.lang.String candidate,
UURI base)
- Perform additional fixup of likely-URI Strings
- Parameters:
string
- detected candidate String
- Returns:
- String changed/decoded to increase likelihood it is a
meaningful non-404 URI
isLikelyUriJavascriptContextLegacy
public static boolean isLikelyUriJavascriptContextLegacy(java.lang.CharSequence candidate)
isLikelyUriHtmlContextLegacy
public static boolean isLikelyUriHtmlContextLegacy(java.lang.CharSequence candidate)
Copyright © 2003-2011 Internet Archive. All Rights Reserved.