org.archive.net
Class LaxURI

java.lang.Object
  extended by org.apache.commons.httpclient.URI
      extended by org.archive.net.LaxURI
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, java.lang.Comparable
Direct Known Subclasses:
UURI

public class LaxURI
extends org.apache.commons.httpclient.URI

URI subclass which allows partial/inconsistent encoding, matching the URIs which will be relayed in requests from popular web browsers (esp. Mozilla Firefox and MS IE).

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.commons.httpclient.URI
org.apache.commons.httpclient.URI.DefaultCharsetChanged, org.apache.commons.httpclient.URI.LocaleToCharsetMap
 
Field Summary
protected static char[] HTTP_SCHEME
           
protected static char[] HTTPS_SCHEME
           
protected static java.util.BitSet lax_abs_path
           
protected static java.util.BitSet lax_query
           
protected static java.util.BitSet lax_rel_segment
           
 
Fields inherited from class org.apache.commons.httpclient.URI
_authority, _fragment, _host, _is_abs_path, _is_hier_part, _is_hostname, _is_IPv4address, _is_IPv6reference, _is_net_path, _is_opaque_part, _is_reg_name, _is_rel_path, _is_server, _opaque, _path, _port, _query, _scheme, _uri, _userinfo, abs_path, absoluteURI, allowed_abs_path, allowed_authority, allowed_fragment, allowed_host, allowed_IPv6reference, allowed_opaque_part, allowed_query, allowed_reg_name, allowed_rel_path, allowed_userinfo, allowed_within_authority, allowed_within_path, allowed_within_query, allowed_within_userinfo, alpha, alphanum, authority, control, defaultDocumentCharset, defaultDocumentCharsetByLocale, defaultDocumentCharsetByPlatform, defaultProtocolCharset, delims, digit, disallowed_opaque_part, disallowed_rel_path, domainlabel, escaped, fragment, hash, hex, hier_part, host, hostname, hostport, IPv4address, IPv6address, IPv6reference, mark, net_path, opaque_part, param, path, path_segments, pchar, percent, port, protocolCharset, query, reg_name, rel_path, rel_segment, relativeURI, reserved, rootPath, scheme, segment, server, space, toplabel, unreserved, unwise, URI_reference, uric, uric_no_slash, userinfo, within_userinfo
 
Constructor Summary
LaxURI()
           
LaxURI(java.lang.String uri, boolean escaped)
           
LaxURI(java.lang.String uri, boolean escaped, java.lang.String charset)
           
LaxURI(org.apache.commons.httpclient.URI base, org.apache.commons.httpclient.URI relative)
           
 
Method Summary
protected static java.lang.String decode(char[] component, java.lang.String charset)
           
protected static java.lang.String decode(java.lang.String component, java.lang.String charset)
           
 java.lang.String getPath()
           
 java.lang.String getPathQuery()
           
 java.lang.String getURI()
           
protected  java.util.BitSet lax(java.util.BitSet generous)
          Given a BitSet -- typically one of the URI superclass's predefined static variables -- possibly replace it with a more-lax version to better match the character sets actually left unencoded in web browser requests
protected  void parseAuthority(java.lang.String original, boolean escaped)
          Coalesce the _host and _authority fields where possible.
protected  void parseUriReference(java.lang.String original, boolean escaped)
          IA OVERRIDDEN IN LaxURI TO INCLUDE FIX FOR http://issues.apache.org/jira/browse/HTTPCLIENT-588 AND http://webteam.archive.org/jira/browse/HER-1268 In order to avoid any possilbity of conflict with non-ASCII characters, Parse a URI reference as a String with the character encoding of the local system or the document.
protected  void setURI()
          Coalesce _scheme to existing instances, where appropriate.
protected  boolean validate(char[] component, java.util.BitSet generous)
           
protected  boolean validate(char[] component, int soffset, int eoffset, java.util.BitSet generous)
           
 
Methods inherited from class org.apache.commons.httpclient.URI
clone, compareTo, encode, equals, equals, getAboveHierPath, getAuthority, getCurrentHierPath, getDefaultDocumentCharset, getDefaultDocumentCharsetByLocale, getDefaultDocumentCharsetByPlatform, getDefaultProtocolCharset, getEscapedAboveHierPath, getEscapedAuthority, getEscapedCurrentHierPath, getEscapedFragment, getEscapedName, getEscapedPath, getEscapedPathQuery, getEscapedQuery, getEscapedURI, getEscapedURIReference, getEscapedUserinfo, getFragment, getHost, getName, getPort, getProtocolCharset, getQuery, getRawAboveHierPath, getRawAuthority, getRawCurrentHierPath, getRawCurrentHierPath, getRawFragment, getRawHost, getRawName, getRawPath, getRawPathQuery, getRawQuery, getRawScheme, getRawURI, getRawURIReference, getRawUserinfo, getScheme, getURIReference, getUserinfo, hasAuthority, hasFragment, hashCode, hasQuery, hasUserinfo, indexFirstOf, indexFirstOf, indexFirstOf, indexFirstOf, isAbsoluteURI, isAbsPath, isHierPart, isHostname, isIPv4address, isIPv6reference, isNetPath, isOpaquePart, isRegName, isRelativeURI, isRelPath, isServer, normalize, normalize, prevalidate, removeFragmentIdentifier, resolvePath, setDefaultDocumentCharset, setDefaultProtocolCharset, setEscapedAuthority, setEscapedFragment, setEscapedPath, setEscapedQuery, setFragment, setPath, setQuery, setRawAuthority, setRawFragment, setRawPath, setRawQuery, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

HTTP_SCHEME

protected static final char[] HTTP_SCHEME

HTTPS_SCHEME

protected static final char[] HTTPS_SCHEME

lax_rel_segment

protected static final java.util.BitSet lax_rel_segment

lax_abs_path

protected static final java.util.BitSet lax_abs_path

lax_query

protected static final java.util.BitSet lax_query
Constructor Detail

LaxURI

public LaxURI(java.lang.String uri,
              boolean escaped,
              java.lang.String charset)
       throws org.apache.commons.httpclient.URIException
Throws:
org.apache.commons.httpclient.URIException

LaxURI

public LaxURI(org.apache.commons.httpclient.URI base,
              org.apache.commons.httpclient.URI relative)
       throws org.apache.commons.httpclient.URIException
Throws:
org.apache.commons.httpclient.URIException

LaxURI

public LaxURI(java.lang.String uri,
              boolean escaped)
       throws org.apache.commons.httpclient.URIException
Throws:
org.apache.commons.httpclient.URIException

LaxURI

public LaxURI()
Method Detail

getURI

public java.lang.String getURI()
                        throws org.apache.commons.httpclient.URIException
Overrides:
getURI in class org.apache.commons.httpclient.URI
Throws:
org.apache.commons.httpclient.URIException

getPath

public java.lang.String getPath()
                         throws org.apache.commons.httpclient.URIException
Overrides:
getPath in class org.apache.commons.httpclient.URI
Throws:
org.apache.commons.httpclient.URIException

getPathQuery

public java.lang.String getPathQuery()
                              throws org.apache.commons.httpclient.URIException
Overrides:
getPathQuery in class org.apache.commons.httpclient.URI
Throws:
org.apache.commons.httpclient.URIException

decode

protected static java.lang.String decode(char[] component,
                                         java.lang.String charset)
                                  throws org.apache.commons.httpclient.URIException
Throws:
org.apache.commons.httpclient.URIException

decode

protected static java.lang.String decode(java.lang.String component,
                                         java.lang.String charset)
                                  throws org.apache.commons.httpclient.URIException
Throws:
org.apache.commons.httpclient.URIException

validate

protected boolean validate(char[] component,
                           java.util.BitSet generous)
Overrides:
validate in class org.apache.commons.httpclient.URI

validate

protected boolean validate(char[] component,
                           int soffset,
                           int eoffset,
                           java.util.BitSet generous)
Overrides:
validate in class org.apache.commons.httpclient.URI

lax

protected java.util.BitSet lax(java.util.BitSet generous)
Given a BitSet -- typically one of the URI superclass's predefined static variables -- possibly replace it with a more-lax version to better match the character sets actually left unencoded in web browser requests

Parameters:
generous - original BitSet
Returns:
(possibly more lax) BitSet to use

parseAuthority

protected void parseAuthority(java.lang.String original,
                              boolean escaped)
                       throws org.apache.commons.httpclient.URIException
Coalesce the _host and _authority fields where possible. In the web crawl/http domain, most URIs have an identical _host and _authority. (There is no port or user info.) However, the superclass always creates two separate char[] instances. Notably, the lengths of these char[] fields are equal if and only if their values are identical. This method makes use of this fact to reduce the two instances to one where possible, slimming instances.

Overrides:
parseAuthority in class org.apache.commons.httpclient.URI
Throws:
org.apache.commons.httpclient.URIException
See Also:
URI.parseAuthority(java.lang.String, boolean)

setURI

protected void setURI()
Coalesce _scheme to existing instances, where appropriate. In the web-crawl domain, most _schemes are 'http' or 'https', but the superclass always creates a new char[] instance. For these two cases, we replace the created instance with a long-lived instance from a static field, saving 12-14 bytes per instance.

Overrides:
setURI in class org.apache.commons.httpclient.URI
See Also:
URI.setURI()

parseUriReference

protected void parseUriReference(java.lang.String original,
                                 boolean escaped)
                          throws org.apache.commons.httpclient.URIException
IA OVERRIDDEN IN LaxURI TO INCLUDE FIX FOR http://issues.apache.org/jira/browse/HTTPCLIENT-588 AND http://webteam.archive.org/jira/browse/HER-1268 In order to avoid any possilbity of conflict with non-ASCII characters, Parse a URI reference as a String with the character encoding of the local system or the document.

The following line is the regular expression for breaking-down a URI reference into its components.

   ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    12            3  4          5       6  7        8 9
 

For example, matching the above expression to http://jakarta.apache.org/ietf/uri/#Related results in the following subexpression matches:

               $1 = http:
  scheme    =  $2 = http
               $3 = //jakarta.apache.org
  authority =  $4 = jakarta.apache.org
  path      =  $5 = /ietf/uri/
               $6 = 
  query     =  $7 = 
               $8 = #Related
  fragment  =  $9 = Related
 

Overrides:
parseUriReference in class org.apache.commons.httpclient.URI
Parameters:
original - the original character sequence
escaped - true if original is escaped
Throws:
org.apache.commons.httpclient.URIException - If an error occurs.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.