org.archive.util
Class SURT
java.lang.Object
org.archive.util.SURT
public class SURT
- extends java.lang.Object
Sort-friendly URI Reordering Transform.
Converts URIs of the form:
scheme://userinfo@domain.tld:port/path?query#fragment
...into...
scheme://(tld,domain,:port@userinfo)/path?query#fragment
The '(' ')' characters serve as an unambiguous notice that the so-called
'authority' portion of the URI ([userinfo@]host[:port] in http URIs) has
been transformed; the commas prevent confusion with regular hostnames.
This remedies the 'problem' with standard URIs that the host portion of a
regular URI, with its dotted-domains, is actually in reverse order from
the natural hierarchy that's usually helpful for grouping and sorting.
The value of respecting URI case variance is considered negligible: it
is vanishingly rare for case-variance to be meaningful, while URI case-
variance often arises from people's confusion or sloppiness, and they
only correct it insofar as necessary to avoid blatant problems. Thus
the usual SURT form is considered to be flattened to all lowercase, and
not completely reversible.
- Author:
- gojomo
Constructor Summary |
SURT()
|
Method Summary |
static java.lang.String |
fromURI(java.lang.String s)
Utility method for creating the SURT form of the URI in the
given String. |
static java.lang.String |
fromURI(java.lang.String s,
boolean preserveCase)
Utility method for creating the SURT form of the URI in the
given String. |
static void |
main(java.lang.String[] args)
Allow class to be used as a command-line tool for converting
URL lists (or naked host or host/path fragments implied
to be HTTP URLs) to SURT form. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DOT
static char DOT
BEGIN_TRANSFORMED_AUTHORITY
static java.lang.String BEGIN_TRANSFORMED_AUTHORITY
TRANSFORMED_HOST_DELIM
static java.lang.String TRANSFORMED_HOST_DELIM
END_TRANSFORMED_AUTHORITY
static java.lang.String END_TRANSFORMED_AUTHORITY
URI_SPLITTER
static java.lang.String URI_SPLITTER
SURT
public SURT()
fromURI
public static java.lang.String fromURI(java.lang.String s)
- Utility method for creating the SURT form of the URI in the
given String.
By default, does not preserve casing.
- Parameters:
s
- String URI to be converted to SURT form
- Returns:
- SURT form
fromURI
public static java.lang.String fromURI(java.lang.String s,
boolean preserveCase)
- Utility method for creating the SURT form of the URI in the
given String.
If it appears a bit convoluted in its approach, note that it was
optimized to minimize object-creation after allocation-sites profiling
indicated this method was a top source of garbage in long-running crawls.
Assumes that the String URI has already been cleaned/fixed (eg
by UURI fixup) in ways that put it in its crawlable form for
evaluation.
- Parameters:
s
- String URI to be converted to SURT formpreserveCase
- whether original case should be preserved
- Returns:
- SURT form
main
public static void main(java.lang.String[] args)
throws java.io.IOException
- Allow class to be used as a command-line tool for converting
URL lists (or naked host or host/path fragments implied
to be HTTP URLs) to SURT form. Lines that cannot be converted
are returned unchanged.
Read from stdin or first file argument. Writes to stdout or
second argument filename
- Parameters:
args
- cmd-line arguments
- Throws:
java.io.IOException
Copyright © 2003-2011 Internet Archive. All Rights Reserved.