org.archive.crawler.datamodel
Interface UriUniqFilter

All Known Implementing Classes:
BdbUriUniqFilter, BloomUriUniqFilter, DiskFPMergeUriUniqFilter, FPMergeUriUniqFilter, FPUriUniqFilter, MemFPMergeUriUniqFilter, MemUriUniqFilter, NoopUriUniqFilter, SetBasedUriUniqFilter

public interface UriUniqFilter

A UriUniqFilter passes URI objects to a destination (receiver) if the passed URI object has not been previously seen. If already seen, the passed URI object is dropped.

For efficiency in comparison against a large history of seen URIs, URI objects may not be passed immediately, unless the addNow() is used or a flush() is forced.

Version:
$Date: 2005-12-16 03:10:54 +0000 (Fri, 16 Dec 2005) $, $Revision: 4036 $
Author:
gojomo

Nested Class Summary
static interface UriUniqFilter.HasUriReceiver
          URIs that have not been seen before 'visit' this 'Visitor'.
 
Method Summary
 void add(java.lang.String key, CandidateURI value)
          Add given uri, if not already present.
 void addForce(java.lang.String key, CandidateURI value)
          Add given uri, all the way through to underlying destination, even if already present.
 void addNow(java.lang.String key, CandidateURI value)
          Immediately add uri.
 void close()
          Close down any allocated resources.
 long count()
           
 void forget(java.lang.String key, CandidateURI value)
          Forget item was seen
 void note(java.lang.String key)
          Note item as seen, without passing through to receiver.
 long pending()
          Count of items added, but not yet filtered in or out.
 long requestFlush()
          Request that any pending items be added/dropped.
 void setDestination(UriUniqFilter.HasUriReceiver receiver)
          Receiver of uniq URIs.
 void setProfileLog(java.io.File logfile)
          Set a File to receive a log for replay profiling.
 

Method Detail

count

long count()
Returns:
Count of already seen URIs.

pending

long pending()
Count of items added, but not yet filtered in or out. Some implementations may buffer up large numbers of pending items to be evaluated in a later large batch/scan/merge with disk files.

Returns:
Count of items added not yet evaluated

setDestination

void setDestination(UriUniqFilter.HasUriReceiver receiver)
Receiver of uniq URIs. Items that have not been seen before are pass through to this object.

Parameters:
receiver - Object that will be passed items. Must implement HasUriReceiver interface.

add

void add(java.lang.String key,
         CandidateURI value)
Add given uri, if not already present.

Parameters:
key - Usually a canonicalized version of value. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

addNow

void addNow(java.lang.String key,
            CandidateURI value)
Immediately add uri.

Parameters:
key - Usually a canonicalized version of uri. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

addForce

void addForce(java.lang.String key,
              CandidateURI value)
Add given uri, all the way through to underlying destination, even if already present. (Sometimes a URI must be fetched, or refetched, for example when DNS or robots info expires or the operator forces a refetch. A normal add() or addNow() would drop the URI without forwarding on once it is determmined to already be in the filter.)

Parameters:
key - Usually a canonicalized version of uri. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

note

void note(java.lang.String key)
Note item as seen, without passing through to receiver.

Parameters:
key - Usually a canonicalized version of an URI. This is the key used doing lookups, forgets and insertions on the already included list.

forget

void forget(java.lang.String key,
            CandidateURI value)
Forget item was seen

Parameters:
key - Usually a canonicalized version of an URI. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

requestFlush

long requestFlush()
Request that any pending items be added/dropped. Implementors may ignore the request if a flush would be too expensive/too soon.

Returns:
Number added.

close

void close()
Close down any allocated resources. Makes sense calling this when checkpointing.


setProfileLog

void setProfileLog(java.io.File logfile)
Set a File to receive a log for replay profiling.



Copyright © 2003-2011 Internet Archive. All Rights Reserved.