org.archive.net
Class UURIFactory

java.lang.Object
  extended by org.apache.commons.httpclient.URI
      extended by org.archive.net.UURIFactory
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, java.lang.Comparable

public class UURIFactory
extends org.apache.commons.httpclient.URI

Factory that returns UURIs. Does escaping and fixup on URIs massaging in accordance with RFC2396 and to match browser practice. For example, it removes any '..' if first thing in the path as per IE, converts backslashes to forward slashes, and discards any 'fragment'/anchor portion of the URI. This class will also fail URIs if they are longer than IE's allowed maximum length.

TODO: Test logging.

Author:
stack
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.commons.httpclient.URI
org.apache.commons.httpclient.URI.DefaultCharsetChanged, org.apache.commons.httpclient.URI.LocaleToCharsetMap
 
Field Summary
(package private) static java.lang.String ACCEPTABLE_ASCII_DOMAIN
          Characters we'll accept in the domain label part of a URI authority: ASCII letters-digits-hyphen (LDH) plus underscore, with single intervening '.' characters.
static java.lang.String APOSTROPH
           
static java.lang.String BACKSLASH
           
static java.lang.String BACKSLASH_PATTERN
           
static java.lang.String CIRCUMFLEX
           
static java.lang.String CIRCUMFLEX_PATTERN
           
static char COLON
           
static java.lang.String COMMERCIAL_AT
           
static java.lang.String DOT
           
static java.lang.String EMPTY_STRING
           
static java.lang.String ESCAPED_APOSTROPH
           
static java.lang.String ESCAPED_BACKSLASH
           
static java.lang.String ESCAPED_CIRCUMFLEX
           
static java.lang.String ESCAPED_LCURBRACKET
           
static java.lang.String ESCAPED_LSQRBRACKET
           
static java.lang.String ESCAPED_PIPE
           
static java.lang.String ESCAPED_QUOT
           
static java.lang.String ESCAPED_RCURBRACKET
           
static java.lang.String ESCAPED_RSQRBRACKET
           
static java.lang.String ESCAPED_SPACE
           
static java.lang.String ESCAPED_SQUOT
           
static java.lang.String HTTP
           
static java.lang.String HTTP_PORT
           
(package private) static java.util.regex.Pattern HTTP_SCHEME_SLASHES
          Pattern that looks for case of three or more slashes after the scheme.
static java.lang.String HTTPS
           
static java.lang.String HTTPS_PORT
           
static int IGNORED_SCHEME
           
static java.lang.String IMPROPERESC
           
static java.lang.String IMPROPERESC_REPLACE
           
static java.lang.String LCURBRACKET
           
static java.lang.String LCURBRACKET_PATTERN
           
static java.lang.String LSQRBRACKET
           
static java.lang.String LSQRBRACKET_PATTERN
           
(package private) static java.util.regex.Pattern MULTIPLE_SLASHES
          Pattern that looks for case of two or more slashes in a path.
static java.lang.String NBSP
           
static char PERCENT_SIGN
           
static java.lang.String PIPE
           
static java.lang.String PIPE_PATTERN
           
(package private) static java.util.regex.Pattern PORTREGEX
          Authority port number regex.
static java.lang.String QUOT
           
static java.lang.String RCURBRACKET
           
static java.lang.String RCURBRACKET_PATTERN
           
(package private) static java.util.regex.Pattern RFC2396REGEX
          RFC 2396-inspired regex.
static java.lang.String RSQRBRACKET
           
static java.lang.String RSQRBRACKET_PATTERN
           
static java.lang.String SLASH
           
static java.lang.String SLASHDOTDOTSLASH
           
static java.lang.String SPACE
           
static java.lang.String SQUOT
           
static java.lang.String STRAY_SPACING
           
static java.lang.String TRAILING_ESCAPED_SPACE
           
static java.lang.String URI_HEX_ENCODING
          First percent sign in string followed by two hex chars.
 
Fields inherited from class org.apache.commons.httpclient.URI
_authority, _fragment, _host, _is_abs_path, _is_hier_part, _is_hostname, _is_IPv4address, _is_IPv6reference, _is_net_path, _is_opaque_part, _is_reg_name, _is_rel_path, _is_server, _opaque, _path, _port, _query, _scheme, _uri, _userinfo, abs_path, absoluteURI, allowed_abs_path, allowed_authority, allowed_fragment, allowed_host, allowed_IPv6reference, allowed_opaque_part, allowed_query, allowed_reg_name, allowed_rel_path, allowed_userinfo, allowed_within_authority, allowed_within_path, allowed_within_query, allowed_within_userinfo, alpha, alphanum, authority, control, defaultDocumentCharset, defaultDocumentCharsetByLocale, defaultDocumentCharsetByPlatform, defaultProtocolCharset, delims, digit, disallowed_opaque_part, disallowed_rel_path, domainlabel, escaped, fragment, hash, hex, hier_part, host, hostname, hostport, IPv4address, IPv6address, IPv6reference, mark, net_path, opaque_part, param, path, path_segments, pchar, percent, port, protocolCharset, query, reg_name, rel_path, rel_segment, relativeURI, reserved, rootPath, scheme, segment, server, space, toplabel, unreserved, unwise, URI_reference, uric, uric_no_slash, userinfo, within_userinfo
 
Method Summary
protected  void checkHttpSchemeSpecificPartSlashPrefix(org.apache.commons.httpclient.URI base, java.lang.String scheme, java.lang.String schemeSpecificPart)
          If http(s) scheme, check scheme specific part begins '//'.
protected  java.lang.String escapeWhitespace(java.lang.String uri)
          Escape any whitespace found.
static UURI getInstance(java.lang.String uri)
           
static UURI getInstance(java.lang.String uri, java.lang.String charset)
           
static UURI getInstance(UURI base, java.lang.String relative)
           
static boolean hasSupportedScheme(java.lang.String possibleUrl)
          Test of whether passed String has an allowed URI scheme.
protected  UURI validityCheck(UURI uuri)
          Check the generated UURI.
 
Methods inherited from class org.apache.commons.httpclient.URI
clone, compareTo, decode, decode, encode, equals, equals, getAboveHierPath, getAuthority, getCurrentHierPath, getDefaultDocumentCharset, getDefaultDocumentCharsetByLocale, getDefaultDocumentCharsetByPlatform, getDefaultProtocolCharset, getEscapedAboveHierPath, getEscapedAuthority, getEscapedCurrentHierPath, getEscapedFragment, getEscapedName, getEscapedPath, getEscapedPathQuery, getEscapedQuery, getEscapedURI, getEscapedURIReference, getEscapedUserinfo, getFragment, getHost, getName, getPath, getPathQuery, getPort, getProtocolCharset, getQuery, getRawAboveHierPath, getRawAuthority, getRawCurrentHierPath, getRawCurrentHierPath, getRawFragment, getRawHost, getRawName, getRawPath, getRawPathQuery, getRawQuery, getRawScheme, getRawURI, getRawURIReference, getRawUserinfo, getScheme, getURI, getURIReference, getUserinfo, hasAuthority, hasFragment, hashCode, hasQuery, hasUserinfo, indexFirstOf, indexFirstOf, indexFirstOf, indexFirstOf, isAbsoluteURI, isAbsPath, isHierPart, isHostname, isIPv4address, isIPv6reference, isNetPath, isOpaquePart, isRegName, isRelativeURI, isRelPath, isServer, normalize, normalize, parseAuthority, parseUriReference, prevalidate, removeFragmentIdentifier, resolvePath, setDefaultDocumentCharset, setDefaultProtocolCharset, setEscapedAuthority, setEscapedFragment, setEscapedPath, setEscapedQuery, setFragment, setPath, setQuery, setRawAuthority, setRawFragment, setRawPath, setRawQuery, setURI, toString, validate, validate
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

RFC2396REGEX

static final java.util.regex.Pattern RFC2396REGEX
RFC 2396-inspired regex. From the RFC Appendix B:
 URI Generic Syntax                August 1998

 B. Parsing a URI Reference with a Regular Expression

 As described in Section 4.3, the generic URI syntax is not sufficient
 to disambiguate the components of some forms of URI.  Since the
 "greedy algorithm" described in that section is identical to the
 disambiguation method used by POSIX regular expressions, it is
 natural and commonplace to use a regular expression for parsing the
 potential four components and fragment identifier of a URI reference.

 The following line is the regular expression for breaking-down a URI
 reference into its components.

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

 The numbers in the second line above are only to assist readability;
 they indicate the reference points for each subexpression (i.e., each
 paired parenthesis).  We refer to the value matched for subexpression
  as $.  For example, matching the above expression to

 http://www.ics.uci.edu/pub/ietf/uri/#Related

 results in the following subexpression matches:

 $1 = http:
 $2 = http
 $3 = //www.ics.uci.edu
 $4 = www.ics.uci.edu
 $5 = /pub/ietf/uri/
 $6 = 
 $7 = 
 $8 = #Related
 $9 = Related

 where  indicates that the component is not present, as is
 the case for the query component in the above example.  Therefore, we
 can determine the value of the four components and fragment as

 scheme    = $2
 authority = $4
 path      = $5
 query     = $7
 fragment  = $9
 
--

Below differs from the rfc regex in that... (1) it has java escaping of regex characters (2) we allow a URI made of a fragment only (Added extra group so indexing is off by one after scheme). (3) scheme is limited to legal scheme characters


SLASHDOTDOTSLASH

public static final java.lang.String SLASHDOTDOTSLASH
See Also:
Constant Field Values

SLASH

public static final java.lang.String SLASH
See Also:
Constant Field Values

HTTP

public static final java.lang.String HTTP
See Also:
Constant Field Values

HTTP_PORT

public static final java.lang.String HTTP_PORT
See Also:
Constant Field Values

HTTPS

public static final java.lang.String HTTPS
See Also:
Constant Field Values

HTTPS_PORT

public static final java.lang.String HTTPS_PORT
See Also:
Constant Field Values

DOT

public static final java.lang.String DOT
See Also:
Constant Field Values

EMPTY_STRING

public static final java.lang.String EMPTY_STRING
See Also:
Constant Field Values

NBSP

public static final java.lang.String NBSP
See Also:
Constant Field Values

SPACE

public static final java.lang.String SPACE
See Also:
Constant Field Values

ESCAPED_SPACE

public static final java.lang.String ESCAPED_SPACE
See Also:
Constant Field Values

TRAILING_ESCAPED_SPACE

public static final java.lang.String TRAILING_ESCAPED_SPACE
See Also:
Constant Field Values

PIPE

public static final java.lang.String PIPE
See Also:
Constant Field Values

PIPE_PATTERN

public static final java.lang.String PIPE_PATTERN
See Also:
Constant Field Values

ESCAPED_PIPE

public static final java.lang.String ESCAPED_PIPE
See Also:
Constant Field Values

CIRCUMFLEX

public static final java.lang.String CIRCUMFLEX
See Also:
Constant Field Values

CIRCUMFLEX_PATTERN

public static final java.lang.String CIRCUMFLEX_PATTERN
See Also:
Constant Field Values

ESCAPED_CIRCUMFLEX

public static final java.lang.String ESCAPED_CIRCUMFLEX
See Also:
Constant Field Values

QUOT

public static final java.lang.String QUOT
See Also:
Constant Field Values

ESCAPED_QUOT

public static final java.lang.String ESCAPED_QUOT
See Also:
Constant Field Values

SQUOT

public static final java.lang.String SQUOT
See Also:
Constant Field Values

ESCAPED_SQUOT

public static final java.lang.String ESCAPED_SQUOT
See Also:
Constant Field Values

APOSTROPH

public static final java.lang.String APOSTROPH
See Also:
Constant Field Values

ESCAPED_APOSTROPH

public static final java.lang.String ESCAPED_APOSTROPH
See Also:
Constant Field Values

LSQRBRACKET

public static final java.lang.String LSQRBRACKET
See Also:
Constant Field Values

LSQRBRACKET_PATTERN

public static final java.lang.String LSQRBRACKET_PATTERN
See Also:
Constant Field Values

ESCAPED_LSQRBRACKET

public static final java.lang.String ESCAPED_LSQRBRACKET
See Also:
Constant Field Values

RSQRBRACKET

public static final java.lang.String RSQRBRACKET
See Also:
Constant Field Values

RSQRBRACKET_PATTERN

public static final java.lang.String RSQRBRACKET_PATTERN
See Also:
Constant Field Values

ESCAPED_RSQRBRACKET

public static final java.lang.String ESCAPED_RSQRBRACKET
See Also:
Constant Field Values

LCURBRACKET

public static final java.lang.String LCURBRACKET
See Also:
Constant Field Values

LCURBRACKET_PATTERN

public static final java.lang.String LCURBRACKET_PATTERN
See Also:
Constant Field Values

ESCAPED_LCURBRACKET

public static final java.lang.String ESCAPED_LCURBRACKET
See Also:
Constant Field Values

RCURBRACKET

public static final java.lang.String RCURBRACKET
See Also:
Constant Field Values

RCURBRACKET_PATTERN

public static final java.lang.String RCURBRACKET_PATTERN
See Also:
Constant Field Values

ESCAPED_RCURBRACKET

public static final java.lang.String ESCAPED_RCURBRACKET
See Also:
Constant Field Values

BACKSLASH

public static final java.lang.String BACKSLASH
See Also:
Constant Field Values

BACKSLASH_PATTERN

public static final java.lang.String BACKSLASH_PATTERN
See Also:
Constant Field Values

ESCAPED_BACKSLASH

public static final java.lang.String ESCAPED_BACKSLASH
See Also:
Constant Field Values

STRAY_SPACING

public static final java.lang.String STRAY_SPACING
See Also:
Constant Field Values

IMPROPERESC_REPLACE

public static final java.lang.String IMPROPERESC_REPLACE
See Also:
Constant Field Values

IMPROPERESC

public static final java.lang.String IMPROPERESC
See Also:
Constant Field Values

COMMERCIAL_AT

public static final java.lang.String COMMERCIAL_AT
See Also:
Constant Field Values

PERCENT_SIGN

public static final char PERCENT_SIGN
See Also:
Constant Field Values

COLON

public static final char COLON
See Also:
Constant Field Values

URI_HEX_ENCODING

public static final java.lang.String URI_HEX_ENCODING
First percent sign in string followed by two hex chars.

See Also:
Constant Field Values

PORTREGEX

static final java.util.regex.Pattern PORTREGEX
Authority port number regex.


ACCEPTABLE_ASCII_DOMAIN

static final java.lang.String ACCEPTABLE_ASCII_DOMAIN
Characters we'll accept in the domain label part of a URI authority: ASCII letters-digits-hyphen (LDH) plus underscore, with single intervening '.' characters. (We accept '_' because DNS servers have tolerated for many years counter to spec; we also accept dash patterns and ACE prefixes that will be rejected by IDN-punycoding attempt.)

See Also:
Constant Field Values

HTTP_SCHEME_SLASHES

static final java.util.regex.Pattern HTTP_SCHEME_SLASHES
Pattern that looks for case of three or more slashes after the scheme. If found, we replace them with two only as mozilla does.


MULTIPLE_SLASHES

static final java.util.regex.Pattern MULTIPLE_SLASHES
Pattern that looks for case of two or more slashes in a path.


IGNORED_SCHEME

public static final int IGNORED_SCHEME
See Also:
Constant Field Values
Method Detail

getInstance

public static UURI getInstance(java.lang.String uri)
                        throws org.apache.commons.httpclient.URIException
Parameters:
uri - URI as string.
Returns:
An instance of UURI
Throws:
org.apache.commons.httpclient.URIException

getInstance

public static UURI getInstance(java.lang.String uri,
                               java.lang.String charset)
                        throws org.apache.commons.httpclient.URIException
Parameters:
uri - URI as string.
charset - Character encoding of the passed uri string.
Returns:
An instance of UURI
Throws:
org.apache.commons.httpclient.URIException

getInstance

public static UURI getInstance(UURI base,
                               java.lang.String relative)
                        throws org.apache.commons.httpclient.URIException
Parameters:
base - Base uri to use resolving passed relative uri.
relative - URI as string.
Returns:
An instance of UURI
Throws:
org.apache.commons.httpclient.URIException

hasSupportedScheme

public static boolean hasSupportedScheme(java.lang.String possibleUrl)
Test of whether passed String has an allowed URI scheme. First tests if likely scheme suffix. If so, we then test if its one of the supported schemes.

Parameters:
possibleUrl - URL string to examine.
Returns:
True if passed string looks like it could be an URL.

validityCheck

protected UURI validityCheck(UURI uuri)
                      throws org.apache.commons.httpclient.URIException
Check the generated UURI. At the least look at length of uuri string. We were seeing case where before escaping, string was < MAX_URL_LENGTH but after was >. Letting out a too-big message was causing us troubles later down the processing chain.

Parameters:
uuri - Created uuri to check.
Returns:
The passed uuri so can easily inline this check.
Throws:
org.apache.commons.httpclient.URIException

checkHttpSchemeSpecificPartSlashPrefix

protected void checkHttpSchemeSpecificPartSlashPrefix(org.apache.commons.httpclient.URI base,
                                                      java.lang.String scheme,
                                                      java.lang.String schemeSpecificPart)
                                               throws org.apache.commons.httpclient.URIException
If http(s) scheme, check scheme specific part begins '//'.

Throws:
org.apache.commons.httpclient.URIException
See Also:
Section 3.1. Common Internet Scheme Syntax

escapeWhitespace

protected java.lang.String escapeWhitespace(java.lang.String uri)
Escape any whitespace found. The parent class takes care of the bulk of escaping. But if any instance of escaping is found in the URI, then we ask for parent to do NO escaping. Here we escape any whitespace found irrespective of whether the uri has already been escaped. We do this for case where uri has been judged already-escaped only, its been incompletly done and whitespace remains. Spaces, etc., in the URI are a real pain. Their presence will break log file and ARC parsing.

Parameters:
uri - URI string to check.
Returns:
uri with spaces escaped if any found.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.