org.archive.crawler.datamodel
Class CandidateURI

java.lang.Object
  extended by org.archive.crawler.datamodel.CandidateURI
All Implemented Interfaces:
java.io.Serializable, CoreAttributeConstants, Reporter
Direct Known Subclasses:
CrawlURI

public class CandidateURI
extends java.lang.Object
implements java.io.Serializable, Reporter, CoreAttributeConstants

A URI, discovered or passed-in, that may be scheduled. When scheduled, a CandidateURI becomes a CrawlURI made with the data contained herein. A CandidateURI contains just the fields necessary to perform quick in-scope analysis.

Has a flexible attribute list that will be promoted into any CrawlURI created from this CandidateURI. Use it to add custom data or state needed later doing custom processing. See accessors/setters putString(String, String), getString(String), etc.

Author:
Gordon Mohr
See Also:
Serialized Form

Field Summary
static int HIGH
          High scheduling priority.
static int HIGHEST
          Highest scheduling priority.
static int MEDIUM
          Medium priority.
static int NORMAL
          Normal/low priority.
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
protected CandidateURI()
          Constructor.
  CandidateURI(UURI u)
           
  CandidateURI(UURI u, java.lang.String pathFromSeed, UURI via, java.lang.CharSequence viaContext)
           
 
Method Summary
protected  void clearAList()
           
 boolean containsKey(java.lang.String key)
           
 CandidateURI createCandidateURI(UURI baseUURI, Link link)
          Utility method for creation of CandidateURIs found extracting links from this CrawlURI.
 CandidateURI createCandidateURI(UURI baseUURI, Link link, int scheduling, boolean seed)
          Utility method for creation of CandidateURIs found extracting links from this CrawlURI.
static CandidateURI createSeedCandidateURI(UURI uuri)
           
 java.lang.String flattenVia()
          Method returns string version of this URI's referral URI.
 boolean forceFetch()
          If this method returns true, this URI should be fetched even though it already has been crawled.
static CandidateURI fromString(java.lang.String uriHopsViaString)
          Given a string containing a URI, then optional whitespace delimited hops-path and via info, create a CandidateURI instance.
 st.ata.util.AList getAList()
          Assumption is that only one thread at a time will ever be accessing a particular CandidateURI.
 java.lang.String getCandidateURIString()
           
 java.lang.String getClassKey()
          Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.
 int getInt(java.lang.String key)
           
 long getLong(java.lang.String key)
           
 java.lang.Object getObject(java.lang.String key)
           
 java.lang.String getPathFromSeed()
           
 java.lang.String[] getReports()
          Get an array of report names offered by this Reporter.
 int getSchedulingDirective()
           
 java.lang.String getString(java.lang.String key)
           
 int getTransHops()
          Tally up the number of transitive (non-simple-link) hops at the end of this CandidateURI's pathFromSeed.
 java.lang.String getURIString()
          Deprecated. Use toString().
 UURI getUURI()
           
 UURI getVia()
           
 java.lang.CharSequence getViaContext()
           
protected  void inheritFrom(CandidateURI ancestor)
          Inherit (copy) the relevant keys-values from the ancestor.
 boolean isLocation()
           
 boolean isSeed()
           
 java.util.Iterator keys()
           
 void makeHeritable(java.lang.String key)
          Make the given key 'heritable', meaning its value will be added to descendant CandidateURIs.
 void makeNonHeritable(java.lang.String key)
          Make the given key non-'heritable', meaning its value will not be added to descendant CandidateURIs.
 boolean needsImmediateScheduling()
           
 boolean needsSoonScheduling()
           
 void putInt(java.lang.String key, int value)
           
 void putLong(java.lang.String key, long value)
           
 void putObject(java.lang.String key, java.lang.Object value)
           
 void putString(java.lang.String key, java.lang.String value)
           
protected  UURI readUuri(java.lang.String u)
          Read a UURI from a String, handling a null or URIException
 void remove(java.lang.String key)
           
 void reportTo(java.io.PrintWriter writer)
          Make a default report to the passed-in Writer.
 void reportTo(java.lang.String name, java.io.PrintWriter writer)
          Make a report of the given name to the passed-in Writer, If null, give the default report.
 boolean sameDomainAs(CandidateURI other)
          Compares the domain of this CandidateURI with that of another CandidateURI
protected  void setAList(st.ata.util.AList alist)
          Called when making a copy of another CandidateURI.
 void setClassKey(java.lang.String key)
           
 void setForceFetch(boolean b)
          Method to signal that this URI should be fetched even though it already has been crawled.
 void setIsSeed(boolean b)
          Set the isSeed attribute of this URI.
protected  void setPathFromSeed(java.lang.String string)
           
 void setSchedulingDirective(int schedulingDirective)
           
 void setVia(UURI via)
           
 java.lang.String singleLineLegend()
          Return a legend for the single-line summary report as a String.
 java.lang.String singleLineReport()
          Return a short single-line summary report as a String.
 void singleLineReportTo(java.io.PrintWriter w)
          Make a single-line summary report to the passed-in writer
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

HIGHEST

public static final int HIGHEST
Highest scheduling priority. Before any others of its class.

See Also:
Constant Field Values

HIGH

public static final int HIGH
High scheduling priority. After any HIGHEST.

See Also:
Constant Field Values

MEDIUM

public static final int MEDIUM
Medium priority. After any HIGH.

See Also:
Constant Field Values

NORMAL

public static final int NORMAL
Normal/low priority. Whenever/end of queue.

See Also:
Constant Field Values
Constructor Detail

CandidateURI

protected CandidateURI()
Constructor. Protected access to block access to default constructor.


CandidateURI

public CandidateURI(UURI u)
Parameters:
u - uuri instance this CandidateURI wraps.

CandidateURI

public CandidateURI(UURI u,
                    java.lang.String pathFromSeed,
                    UURI via,
                    java.lang.CharSequence viaContext)
Parameters:
u - uuri instance this CandidateURI wraps.
pathFromSeed -
via -
viaContext -
Method Detail

setIsSeed

public void setIsSeed(boolean b)
Set the isSeed attribute of this URI.

Parameters:
b - Is this URI a seed, true or false.

getUURI

public UURI getUURI()
Returns:
UURI

isSeed

public boolean isSeed()
Returns:
Whether seeded.

getPathFromSeed

public java.lang.String getPathFromSeed()
Returns:
path (hop-types) from seed

getVia

public UURI getVia()
Returns:
URI via which this one was discovered

getViaContext

public java.lang.CharSequence getViaContext()
Returns:
CharSequence context in which this one was discovered

setPathFromSeed

protected void setPathFromSeed(java.lang.String string)
Parameters:
string -

setAList

protected void setAList(st.ata.util.AList alist)
Called when making a copy of another CandidateURI.

Parameters:
alist - AList to use.

setVia

public void setVia(UURI via)

getCandidateURIString

public java.lang.String getCandidateURIString()
Returns:
This candidate URI as a string wrapped with 'CandidateURI(' + ')'.

flattenVia

public java.lang.String flattenVia()
Method returns string version of this URI's referral URI.

Returns:
String version of referral URI

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object
Returns:
The UURI this CandidateURI wraps as a string (We used return what getCandidateURIString() returns on a toString -- use that method if you still need this functionality).
See Also:
getCandidateURIString()

getURIString

public java.lang.String getURIString()
Deprecated. Use toString().

Returns:
URI String

sameDomainAs

public boolean sameDomainAs(CandidateURI other)
                     throws org.apache.commons.httpclient.URIException
Compares the domain of this CandidateURI with that of another CandidateURI

Parameters:
other - The other CandidateURI
Returns:
True if both are in the same domain, false otherwise.
Throws:
org.apache.commons.httpclient.URIException

forceFetch

public boolean forceFetch()
If this method returns true, this URI should be fetched even though it already has been crawled. This also implies that this URI will be scheduled for crawl before any other waiting URIs for the same host. This value is used to refetch any expired robots.txt or dns-lookups.

Returns:
true if crawling of this URI should be forced

setForceFetch

public void setForceFetch(boolean b)
Method to signal that this URI should be fetched even though it already has been crawled. Setting this to true also implies that this URI will be scheduled for crawl before any other waiting URIs for the same host. This value is used to refetch any expired robots.txt or dns-lookups.

Parameters:
b - set to true to enforce the crawling of this URI

getSchedulingDirective

public int getSchedulingDirective()
Returns:
Returns the schedulingDirective.

setSchedulingDirective

public void setSchedulingDirective(int schedulingDirective)
Parameters:
schedulingDirective - The schedulingDirective to set.

needsImmediateScheduling

public boolean needsImmediateScheduling()
Returns:
True if needs immediate scheduling.

needsSoonScheduling

public boolean needsSoonScheduling()
Returns:
True if needs soon but not top scheduling.

getTransHops

public int getTransHops()
Tally up the number of transitive (non-simple-link) hops at the end of this CandidateURI's pathFromSeed. In some cases, URIs with greater than zero but less than some threshold such hops are treated specially.

TODO: consider moving link-count in here as well, caching calculation, and refactoring CrawlScope.exceedsMaxHops() to use this.

Returns:
Transhop count.

fromString

public static CandidateURI fromString(java.lang.String uriHopsViaString)
                               throws org.apache.commons.httpclient.URIException
Given a string containing a URI, then optional whitespace delimited hops-path and via info, create a CandidateURI instance.

Parameters:
uriHopsViaString - String with a URI.
Returns:
A CandidateURI made from passed uriHopsViaString.
Throws:
org.apache.commons.httpclient.URIException

createSeedCandidateURI

public static CandidateURI createSeedCandidateURI(UURI uuri)

createCandidateURI

public CandidateURI createCandidateURI(UURI baseUURI,
                                       Link link)
                                throws org.apache.commons.httpclient.URIException
Utility method for creation of CandidateURIs found extracting links from this CrawlURI.

Parameters:
baseUURI - BaseUURI for link.
link - Link to wrap CandidateURI in.
Returns:
New candidateURI wrapper around link.
Throws:
org.apache.commons.httpclient.URIException

createCandidateURI

public CandidateURI createCandidateURI(UURI baseUURI,
                                       Link link,
                                       int scheduling,
                                       boolean seed)
                                throws org.apache.commons.httpclient.URIException
Utility method for creation of CandidateURIs found extracting links from this CrawlURI.

Parameters:
baseUURI - BaseUURI for link.
link - Link to wrap CandidateURI in.
scheduling - How new CandidateURI should be scheduled.
seed - True if this CandidateURI is a seed.
Returns:
New candidateURI wrapper around link.
Throws:
org.apache.commons.httpclient.URIException

inheritFrom

protected void inheritFrom(CandidateURI ancestor)
Inherit (copy) the relevant keys-values from the ancestor.

Parameters:
ancestor -

getClassKey

public java.lang.String getClassKey()
Get the token (usually the hostname + port) which indicates what "class" this CrawlURI should be grouped with, for the purposes of ensuring only one item of the class is processed at once, all items of the class are held for a politeness period, etc.

Returns:
Token (usually the hostname) which indicates what "class" this CrawlURI should be grouped with.

setClassKey

public void setClassKey(java.lang.String key)

getAList

public st.ata.util.AList getAList()
Assumption is that only one thread at a time will ever be accessing a particular CandidateURI.

Returns:
the attribute list.

clearAList

protected void clearAList()

putObject

public void putObject(java.lang.String key,
                      java.lang.Object value)

getObject

public java.lang.Object getObject(java.lang.String key)

getString

public java.lang.String getString(java.lang.String key)

putString

public void putString(java.lang.String key,
                      java.lang.String value)

getLong

public long getLong(java.lang.String key)

putLong

public void putLong(java.lang.String key,
                    long value)

getInt

public int getInt(java.lang.String key)

putInt

public void putInt(java.lang.String key,
                   int value)

containsKey

public boolean containsKey(java.lang.String key)

remove

public void remove(java.lang.String key)

keys

public java.util.Iterator keys()

isLocation

public boolean isLocation()
Returns:
True if this CandidateURI was result of a redirect: i.e. Its parent URI redirected to here, this URI was what was in the 'Location:' or 'Content-Location:' HTTP Header.

readUuri

protected UURI readUuri(java.lang.String u)
Read a UURI from a String, handling a null or URIException

Parameters:
u - String or null from which to create UURI
Returns:
the best UURI instance creatable

singleLineReport

public java.lang.String singleLineReport()
Description copied from interface: Reporter
Return a short single-line summary report as a String.

Specified by:
singleLineReport in interface Reporter
Returns:
String single-line summary report

singleLineReportTo

public void singleLineReportTo(java.io.PrintWriter w)
Description copied from interface: Reporter
Make a single-line summary report to the passed-in writer

Specified by:
singleLineReportTo in interface Reporter
Parameters:
w - to receive report

singleLineLegend

public java.lang.String singleLineLegend()
Description copied from interface: Reporter
Return a legend for the single-line summary report as a String.

Specified by:
singleLineLegend in interface Reporter
Returns:
String single-line summary legend

getReports

public java.lang.String[] getReports()
Description copied from interface: Reporter
Get an array of report names offered by this Reporter. A name in brackets indicates a free-form String, in accordance with the informal description inside the brackets, may yield a useful report.

Specified by:
getReports in interface Reporter
Returns:
String array of report names, empty if there is only one report type

reportTo

public void reportTo(java.lang.String name,
                     java.io.PrintWriter writer)
Description copied from interface: Reporter
Make a report of the given name to the passed-in Writer, If null, give the default report.

Specified by:
reportTo in interface Reporter
writer - to receive report

reportTo

public void reportTo(java.io.PrintWriter writer)
              throws java.io.IOException
Description copied from interface: Reporter
Make a default report to the passed-in Writer. Should be equivalent to reportTo(null, writer)

Specified by:
reportTo in interface Reporter
Parameters:
writer - to receive report
Throws:
java.io.IOException

makeHeritable

public void makeHeritable(java.lang.String key)
Make the given key 'heritable', meaning its value will be added to descendant CandidateURIs. Only keys with immutable values should be made heritable -- the value instance may be shared until the AList is serialized/deserialized.

Parameters:
key - to make heritable

makeNonHeritable

public void makeNonHeritable(java.lang.String key)
Make the given key non-'heritable', meaning its value will not be added to descendant CandidateURIs. Only meaningful if key was previously made heritable.

Parameters:
key - to make non-heritable


Copyright © 2003-2011 Internet Archive. All Rights Reserved.