org.archive.util
Class SURT

java.lang.Object
  extended by org.archive.util.SURT

public class SURT
extends java.lang.Object

Sort-friendly URI Reordering Transform. Converts URIs of the form: scheme://userinfo@domain.tld:port/path?query#fragment ...into... scheme://(tld,domain,:port@userinfo)/path?query#fragment The '(' ')' characters serve as an unambiguous notice that the so-called 'authority' portion of the URI ([userinfo@]host[:port] in http URIs) has been transformed; the commas prevent confusion with regular hostnames. This remedies the 'problem' with standard URIs that the host portion of a regular URI, with its dotted-domains, is actually in reverse order from the natural hierarchy that's usually helpful for grouping and sorting. The value of respecting URI case variance is considered negligible: it is vanishingly rare for case-variance to be meaningful, while URI case- variance often arises from people's confusion or sloppiness, and they only correct it insofar as necessary to avoid blatant problems. Thus the usual SURT form is considered to be flattened to all lowercase, and not completely reversible.

Author:
gojomo

Field Summary
(package private) static java.lang.String BEGIN_TRANSFORMED_AUTHORITY
           
(package private) static char DOT
           
(package private) static java.lang.String END_TRANSFORMED_AUTHORITY
           
(package private) static java.lang.String TRANSFORMED_HOST_DELIM
           
(package private) static java.lang.String URI_SPLITTER
           
 
Constructor Summary
SURT()
           
 
Method Summary
static java.lang.String fromURI(java.lang.String s)
          Utility method for creating the SURT form of the URI in the given String.
static java.lang.String fromURI(java.lang.String s, boolean preserveCase)
          Utility method for creating the SURT form of the URI in the given String.
static void main(java.lang.String[] args)
          Allow class to be used as a command-line tool for converting URL lists (or naked host or host/path fragments implied to be HTTP URLs) to SURT form.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DOT

static char DOT

BEGIN_TRANSFORMED_AUTHORITY

static java.lang.String BEGIN_TRANSFORMED_AUTHORITY

TRANSFORMED_HOST_DELIM

static java.lang.String TRANSFORMED_HOST_DELIM

END_TRANSFORMED_AUTHORITY

static java.lang.String END_TRANSFORMED_AUTHORITY

URI_SPLITTER

static java.lang.String URI_SPLITTER
Constructor Detail

SURT

public SURT()
Method Detail

fromURI

public static java.lang.String fromURI(java.lang.String s)
Utility method for creating the SURT form of the URI in the given String. By default, does not preserve casing.

Parameters:
s - String URI to be converted to SURT form
Returns:
SURT form

fromURI

public static java.lang.String fromURI(java.lang.String s,
                                       boolean preserveCase)
Utility method for creating the SURT form of the URI in the given String. If it appears a bit convoluted in its approach, note that it was optimized to minimize object-creation after allocation-sites profiling indicated this method was a top source of garbage in long-running crawls. Assumes that the String URI has already been cleaned/fixed (eg by UURI fixup) in ways that put it in its crawlable form for evaluation.

Parameters:
s - String URI to be converted to SURT form
preserveCase - whether original case should be preserved
Returns:
SURT form

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Allow class to be used as a command-line tool for converting URL lists (or naked host or host/path fragments implied to be HTTP URLs) to SURT form. Lines that cannot be converted are returned unchanged. Read from stdin or first file argument. Writes to stdout or second argument filename

Parameters:
args - cmd-line arguments
Throws:
java.io.IOException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.