org.archive.crawler.util
Class BloomUriUniqFilter

java.lang.Object
  extended by org.archive.crawler.util.SetBasedUriUniqFilter
      extended by org.archive.crawler.util.BloomUriUniqFilter
All Implemented Interfaces:
java.io.Serializable, UriUniqFilter

public class BloomUriUniqFilter
extends SetBasedUriUniqFilter
implements java.io.Serializable

A MG4J BloomFilter-based implementation of an AlreadySeen list. This implementation performs adequately without blowing out the heap through to very large numbers of URIs. See AlreadySeen. It is inherent to Bloom filters that as they get 'saturated', their false-positive rate rises. The default parameters used by this class attempt to maintain a 1-in-4 million (1 in 2^22) false-positive chance through 125 million unique inserts, which creates a filter structure about 495MB in size. You may use the following system properties to tune the size and false-positive rate of the bloom filter structure used by this class: org.archive.crawler.util.BloomUriUniqFilter.expected-size (default 125000000) org.archive.crawler.util.BloomUriUniqFilter.hash-count (default 22) The resulting filter will take up approximately... 1.44 * expected-size * hash-count / 8 ...bytes. The default size is very close to the maximum practical size of the Bloom filter implementation, BloomFilter32bitSplit, created in the initialize() method, due to integer arithmetic limits. If you need a larger filter, you should edit the initialize method to intantiate a BloomFilter64bit instead.

Version:
$Date: 2010-06-15 22:22:25 +0000 (Tue, 15 Jun 2010) $, $Revision: 6891 $
Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from interface org.archive.crawler.datamodel.UriUniqFilter
UriUniqFilter.HasUriReceiver
 
Field Summary
(package private)  BloomFilter bloom
           
protected  int expected_n
           
protected static java.lang.String EXPECTED_SIZE_KEY
           
protected static java.lang.String HASH_COUNT_KEY
           
 
Fields inherited from class org.archive.crawler.util.SetBasedUriUniqFilter
duplicateCount, duplicatesAtLastSample, profileLog, receiver
 
Constructor Summary
BloomUriUniqFilter()
          Default constructor
BloomUriUniqFilter(int n, int d)
          Constructor.
 
Method Summary
 void forget(java.lang.String canonical, CandidateURI item)
          Forget item was seen
protected  void initialize(int n, int d)
          Initializer shared by constructors.
protected  boolean setAdd(java.lang.CharSequence uri)
           
protected  long setCount()
           
protected  boolean setRemove(java.lang.CharSequence uri)
           
 
Methods inherited from class org.archive.crawler.util.SetBasedUriUniqFilter
add, addForce, addNow, close, count, note, pending, profileLog, requestFlush, setDestination, setProfileLog
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

bloom

BloomFilter bloom

expected_n

protected int expected_n

EXPECTED_SIZE_KEY

protected static final java.lang.String EXPECTED_SIZE_KEY
See Also:
Constant Field Values

HASH_COUNT_KEY

protected static final java.lang.String HASH_COUNT_KEY
See Also:
Constant Field Values
Constructor Detail

BloomUriUniqFilter

public BloomUriUniqFilter()
Default constructor


BloomUriUniqFilter

public BloomUriUniqFilter(int n,
                          int d)
Constructor.

Parameters:
n - the expected number of elements.
d - the number of hash functions; if the filter adds not more than n elements, false positives will happen with probability 2-d.
Method Detail

initialize

protected void initialize(int n,
                          int d)
Initializer shared by constructors.

Parameters:
n - the expected number of elements.
d - the number of hash functions; if the filter adds not more than n elements, false positives will happen with probability 2-d.

forget

public void forget(java.lang.String canonical,
                   CandidateURI item)
Description copied from interface: UriUniqFilter
Forget item was seen

Specified by:
forget in interface UriUniqFilter
Overrides:
forget in class SetBasedUriUniqFilter
Parameters:
canonical - Usually a canonicalized version of an URI. This is the key used doing lookups, forgets and insertions on the already included list.
item - item to add.

setAdd

protected boolean setAdd(java.lang.CharSequence uri)
Specified by:
setAdd in class SetBasedUriUniqFilter

setCount

protected long setCount()
Specified by:
setCount in class SetBasedUriUniqFilter

setRemove

protected boolean setRemove(java.lang.CharSequence uri)
Specified by:
setRemove in class SetBasedUriUniqFilter


Copyright © 2003-2011 Internet Archive. All Rights Reserved.