org.archive.util
Class SurtPrefixSet

java.lang.Object
  extended by java.util.AbstractCollection<E>
      extended by java.util.AbstractSet<E>
          extended by java.util.TreeSet<java.lang.String>
              extended by org.archive.util.PrefixSet
                  extended by org.archive.util.SurtPrefixSet
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, java.lang.Iterable<java.lang.String>, java.util.Collection<java.lang.String>, java.util.NavigableSet<java.lang.String>, java.util.Set<java.lang.String>, java.util.SortedSet<java.lang.String>

public class SurtPrefixSet
extends PrefixSet

Specialized TreeSet for keeping a set of String prefixes. Redundant prefixes (those that are themselves prefixed by other set entries) are eliminated.

Author:
gojomo
See Also:
Serialized Form

Constructor Summary
SurtPrefixSet()
           
 
Method Summary
 void convertAllPrefixesToDomains()
          Changes all prefixes so that they only enforce a general domain (allowing subdomains).For prefixes that don't include a ')', no change is necessary.
 void convertAllPrefixesToHosts()
          Changes all prefixes so that they enforce an exact host.
static java.lang.String convertPrefixToDomain(java.lang.String prefix)
           
static java.lang.String convertPrefixToHost(java.lang.String prefix)
           
 void exportTo(java.io.Writer fw)
           
static java.lang.String getCandidateSurt(java.lang.Object object)
          Calculate the SURT form URI to use as a candidate against prefixes from the given Object (CandidateURI or UURI)
 void importFrom(java.io.Reader r)
          Read a set of SURT prefixes from a reader source; keep sorted and with redundant entries removed.
 void importFromMixed(java.io.Reader r, boolean deduceFromSeeds)
          Import SURT prefixes from a reader with mixed URI and SURT prefix format.
 void importFromUris(java.io.Reader r)
           
static void main(java.lang.String[] args)
          Allow class to be used as a command-line tool for converting URL lists (or naked host or host/path fragments implied to be HTTP URLs) to implied SURT prefix form.
static java.lang.String prefixFromPlain(java.lang.String u)
          Given a plain URI or hostname/hostname+path, deduce an implied SURT prefix from it.
 
Methods inherited from class org.archive.util.PrefixSet
add, containsPrefixOf
 
Methods inherited from class java.util.TreeSet
addAll, ceiling, clear, clone, comparator, contains, descendingIterator, descendingSet, first, floor, headSet, headSet, higher, isEmpty, iterator, last, lower, pollFirst, pollLast, remove, size, subSet, subSet, tailSet, tailSet
 
Methods inherited from class java.util.AbstractSet
equals, hashCode, removeAll
 
Methods inherited from class java.util.AbstractCollection
containsAll, retainAll, toArray, toArray, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.Set
containsAll, equals, hashCode, removeAll, retainAll, toArray, toArray
 

Constructor Detail

SurtPrefixSet

public SurtPrefixSet()
Method Detail

importFrom

public void importFrom(java.io.Reader r)
Read a set of SURT prefixes from a reader source; keep sorted and with redundant entries removed.

Parameters:
r - reader over file of SURT_format strings
Throws:
java.io.IOException

importFromUris

public void importFromUris(java.io.Reader r)
Parameters:
r - Where to read from.

importFromMixed

public void importFromMixed(java.io.Reader r,
                            boolean deduceFromSeeds)
Import SURT prefixes from a reader with mixed URI and SURT prefix format.

Parameters:
r - the reader to import the prefixes from
deduceFromSeeds - true to also import SURT prefixes implied from normal URIs/hostname seeds

prefixFromPlain

public static java.lang.String prefixFromPlain(java.lang.String u)
Given a plain URI or hostname/hostname+path, deduce an implied SURT prefix from it. Results may be unpredictable on strings that cannot be interpreted as URIs. UURI 'fixup' is applied to the URI that is built.

Parameters:
u - URI or almost-URI to consider
Returns:
implied SURT prefix form

getCandidateSurt

public static java.lang.String getCandidateSurt(java.lang.Object object)
Calculate the SURT form URI to use as a candidate against prefixes from the given Object (CandidateURI or UURI)

Parameters:
object - CandidateURI or UURI
Returns:
SURT form of URI for evaluation, or null if unavailable

exportTo

public void exportTo(java.io.Writer fw)
              throws java.io.IOException
Parameters:
fw -
Throws:
java.io.IOException

convertAllPrefixesToHosts

public void convertAllPrefixesToHosts()
Changes all prefixes so that they enforce an exact host. For prefixes that already include a ')', this means discarding anything after ')' (path info). For prefixes that don't include a ')' -- domain prefixes open to subdomains -- add the closing ')' (or ",)").


convertPrefixToHost

public static java.lang.String convertPrefixToHost(java.lang.String prefix)

convertAllPrefixesToDomains

public void convertAllPrefixesToDomains()
Changes all prefixes so that they only enforce a general domain (allowing subdomains).For prefixes that don't include a ')', no change is necessary. For others, truncate everything from the ')' onward. Additionally, truncate off "www," if it appears.


convertPrefixToDomain

public static java.lang.String convertPrefixToDomain(java.lang.String prefix)

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Allow class to be used as a command-line tool for converting URL lists (or naked host or host/path fragments implied to be HTTP URLs) to implied SURT prefix form. Read from stdin or first file argument. Writes to stdout.

Parameters:
args - cmd-line arguments: may include input file
Throws:
java.io.IOException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.