org.archive.net
Class PublicSuffixes
java.lang.Object
org.archive.net.PublicSuffixes
public class PublicSuffixes
- extends java.lang.Object
Utility class for making use of the information about 'public suffixes' at
http://publicsuffix.org.
The public suffix list (once known as 'effective TLDs') was motivated by the
need to decide on which broader domains a subdomain was allowed to set
cookies. For example, a server at 'www.example.com' can set cookies for
'www.example.com' or 'example.com' but not 'com'. 'www.example.co.uk' can set
cookies for 'www.example.co.uk' or 'example.co.uk' but not 'co.uk' or 'uk'.
The number of rules for all top-level-domains and 2nd- or 3rd- level domains
has become quite long; essentially the broadest domain a subdomain may assign
to is the one that was sold/registered to a specific name registrant.
This concept should be useful in other contexts, too. Grouping URIs (or
queues of URIs to crawl) together with others sharing the same registered
suffix may be useful for applying the same rules to all, such as assigning
them to the same queue or crawler in a multi- machine setup.
- Author:
- Gojomo
Method Summary |
protected static void |
buildRegex(java.lang.String stem,
java.lang.StringBuilder regex,
java.util.SortedSet<java.lang.String> prefixes)
|
static java.util.regex.Pattern |
getTopmostAssignedSurtPrefixPattern()
|
static java.lang.String |
getTopmostAssignedSurtPrefixRegex()
|
static java.lang.String |
getTopmostAssignedSurtPrefixRegex(java.io.BufferedReader reader)
|
static void |
main(java.lang.String[] args)
Utility method for dumping a regex String, based on a published public
suffix list, which matches any SURT-form hostname up through the broadest
'private' (assigned/sold) domain-segment. |
static java.util.List<java.lang.String> |
readPublishedFileToSurtList(java.io.BufferedReader reader)
Reads a file of the format promulgated by publicsuffix.org, ignoring
comments and '!' exceptions/notations, converting domain segments to
SURT-ordering. |
static java.lang.String |
reduceSurtToTopmostAssigned(java.lang.String surt)
Truncate SURT to its topmost assigned domain segment; that is,
the public suffix plus one segment, but as a SURT-ordered prefix. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
topmostAssignedSurtPrefixPattern
protected static java.util.regex.Pattern topmostAssignedSurtPrefixPattern
topmostAssignedSurtPrefixRegex
protected static java.lang.String topmostAssignedSurtPrefixRegex
PublicSuffixes
public PublicSuffixes()
main
public static void main(java.lang.String[] args)
throws java.io.IOException
- Utility method for dumping a regex String, based on a published public
suffix list, which matches any SURT-form hostname up through the broadest
'private' (assigned/sold) domain-segment. That is, for any of the
SURT-form hostnames...
com,example, com,example,www, com,example,california,www
...the regex will match 'com,example,'.
- Parameters:
args
-
- Throws:
java.io.IOException
readPublishedFileToSurtList
public static java.util.List<java.lang.String> readPublishedFileToSurtList(java.io.BufferedReader reader)
throws java.io.IOException
- Reads a file of the format promulgated by publicsuffix.org, ignoring
comments and '!' exceptions/notations, converting domain segments to
SURT-ordering. Leaves glob-style '*' wildcarding in place. Returns sorted
list of unique SURT-ordered prefixes.
- Parameters:
reader
-
- Returns:
-
- Throws:
java.io.IOException
buildRegex
protected static void buildRegex(java.lang.String stem,
java.lang.StringBuilder regex,
java.util.SortedSet<java.lang.String> prefixes)
getTopmostAssignedSurtPrefixPattern
public static java.util.regex.Pattern getTopmostAssignedSurtPrefixPattern()
getTopmostAssignedSurtPrefixRegex
public static java.lang.String getTopmostAssignedSurtPrefixRegex()
getTopmostAssignedSurtPrefixRegex
public static java.lang.String getTopmostAssignedSurtPrefixRegex(java.io.BufferedReader reader)
reduceSurtToTopmostAssigned
public static java.lang.String reduceSurtToTopmostAssigned(java.lang.String surt)
- Truncate SURT to its topmost assigned domain segment; that is,
the public suffix plus one segment, but as a SURT-ordered prefix.
if the pattern doesn't match, the passed-in SURT is returned.
- Parameters:
surt
- SURT to truncate
- Returns:
- truncated-to-topmost-assigned SURT prefix
Copyright © 2003-2011 Internet Archive. All Rights Reserved.