org.archive.net
Class PublicSuffixes

java.lang.Object
  extended by org.archive.net.PublicSuffixes

public class PublicSuffixes
extends java.lang.Object

Utility class for making use of the information about 'public suffixes' at http://publicsuffix.org. The public suffix list (once known as 'effective TLDs') was motivated by the need to decide on which broader domains a subdomain was allowed to set cookies. For example, a server at 'www.example.com' can set cookies for 'www.example.com' or 'example.com' but not 'com'. 'www.example.co.uk' can set cookies for 'www.example.co.uk' or 'example.co.uk' but not 'co.uk' or 'uk'. The number of rules for all top-level-domains and 2nd- or 3rd- level domains has become quite long; essentially the broadest domain a subdomain may assign to is the one that was sold/registered to a specific name registrant. This concept should be useful in other contexts, too. Grouping URIs (or queues of URIs to crawl) together with others sharing the same registered suffix may be useful for applying the same rules to all, such as assigning them to the same queue or crawler in a multi- machine setup.

Author:
Gojomo

Field Summary
protected static java.util.regex.Pattern topmostAssignedSurtPrefixPattern
           
protected static java.lang.String topmostAssignedSurtPrefixRegex
           
 
Constructor Summary
PublicSuffixes()
           
 
Method Summary
protected static void buildRegex(java.lang.String stem, java.lang.StringBuilder regex, java.util.SortedSet<java.lang.String> prefixes)
           
static java.util.regex.Pattern getTopmostAssignedSurtPrefixPattern()
           
static java.lang.String getTopmostAssignedSurtPrefixRegex()
           
static java.lang.String getTopmostAssignedSurtPrefixRegex(java.io.BufferedReader reader)
           
static void main(java.lang.String[] args)
          Utility method for dumping a regex String, based on a published public suffix list, which matches any SURT-form hostname up through the broadest 'private' (assigned/sold) domain-segment.
static java.util.List<java.lang.String> readPublishedFileToSurtList(java.io.BufferedReader reader)
          Reads a file of the format promulgated by publicsuffix.org, ignoring comments and '!' exceptions/notations, converting domain segments to SURT-ordering.
static java.lang.String reduceSurtToTopmostAssigned(java.lang.String surt)
          Truncate SURT to its topmost assigned domain segment; that is, the public suffix plus one segment, but as a SURT-ordered prefix.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

topmostAssignedSurtPrefixPattern

protected static java.util.regex.Pattern topmostAssignedSurtPrefixPattern

topmostAssignedSurtPrefixRegex

protected static java.lang.String topmostAssignedSurtPrefixRegex
Constructor Detail

PublicSuffixes

public PublicSuffixes()
Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Utility method for dumping a regex String, based on a published public suffix list, which matches any SURT-form hostname up through the broadest 'private' (assigned/sold) domain-segment. That is, for any of the SURT-form hostnames... com,example, com,example,www, com,example,california,www ...the regex will match 'com,example,'.

Parameters:
args -
Throws:
java.io.IOException

readPublishedFileToSurtList

public static java.util.List<java.lang.String> readPublishedFileToSurtList(java.io.BufferedReader reader)
                                                                    throws java.io.IOException
Reads a file of the format promulgated by publicsuffix.org, ignoring comments and '!' exceptions/notations, converting domain segments to SURT-ordering. Leaves glob-style '*' wildcarding in place. Returns sorted list of unique SURT-ordered prefixes.

Parameters:
reader -
Returns:
Throws:
java.io.IOException

buildRegex

protected static void buildRegex(java.lang.String stem,
                                 java.lang.StringBuilder regex,
                                 java.util.SortedSet<java.lang.String> prefixes)

getTopmostAssignedSurtPrefixPattern

public static java.util.regex.Pattern getTopmostAssignedSurtPrefixPattern()

getTopmostAssignedSurtPrefixRegex

public static java.lang.String getTopmostAssignedSurtPrefixRegex()

getTopmostAssignedSurtPrefixRegex

public static java.lang.String getTopmostAssignedSurtPrefixRegex(java.io.BufferedReader reader)

reduceSurtToTopmostAssigned

public static java.lang.String reduceSurtToTopmostAssigned(java.lang.String surt)
Truncate SURT to its topmost assigned domain segment; that is, the public suffix plus one segment, but as a SURT-ordered prefix. if the pattern doesn't match, the passed-in SURT is returned.

Parameters:
surt - SURT to truncate
Returns:
truncated-to-topmost-assigned SURT prefix


Copyright © 2003-2011 Internet Archive. All Rights Reserved.