org.archive.crawler.util
Class BloomUriUniqFilter
java.lang.Object
org.archive.crawler.util.SetBasedUriUniqFilter
org.archive.crawler.util.BloomUriUniqFilter
- All Implemented Interfaces:
- java.io.Serializable, UriUniqFilter
public class BloomUriUniqFilter
- extends SetBasedUriUniqFilter
- implements java.io.Serializable
A MG4J BloomFilter-based implementation of an AlreadySeen list.
This implementation performs adequately without blowing out
the heap through to very large numbers of URIs. See
AlreadySeen.
It is inherent to Bloom filters that as they get 'saturated', their
false-positive rate rises. The default parameters used by this class
attempt to maintain a 1-in-4 million (1 in 2^22) false-positive chance
through 125 million unique inserts, which creates a filter structure
about 495MB in size.
You may use the following system properties to tune the size and
false-positive rate of the bloom filter structure used by this class:
org.archive.crawler.util.BloomUriUniqFilter.expected-size (default 125000000)
org.archive.crawler.util.BloomUriUniqFilter.hash-count (default 22)
The resulting filter will take up approximately...
1.44 * expected-size * hash-count / 8
...bytes.
The default size is very close to the maximum practical size of the
Bloom filter implementation, BloomFilter32bitSplit, created in the
initialize() method, due to integer arithmetic limits.
If you need a larger filter, you should edit the initialize
method to intantiate a BloomFilter64bit instead.
- Version:
- $Date: 2010-06-15 22:22:25 +0000 (Tue, 15 Jun 2010) $, $Revision: 6891 $
- Author:
- gojomo
- See Also:
- Serialized Form
Method Summary |
void |
forget(java.lang.String canonical,
CandidateURI item)
Forget item was seen |
protected void |
initialize(int n,
int d)
Initializer shared by constructors. |
protected boolean |
setAdd(java.lang.CharSequence uri)
|
protected long |
setCount()
|
protected boolean |
setRemove(java.lang.CharSequence uri)
|
Methods inherited from class org.archive.crawler.util.SetBasedUriUniqFilter |
add, addForce, addNow, close, count, note, pending, profileLog, requestFlush, setDestination, setProfileLog |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
bloom
BloomFilter bloom
expected_n
protected int expected_n
EXPECTED_SIZE_KEY
protected static final java.lang.String EXPECTED_SIZE_KEY
- See Also:
- Constant Field Values
HASH_COUNT_KEY
protected static final java.lang.String HASH_COUNT_KEY
- See Also:
- Constant Field Values
BloomUriUniqFilter
public BloomUriUniqFilter()
- Default constructor
BloomUriUniqFilter
public BloomUriUniqFilter(int n,
int d)
- Constructor.
- Parameters:
n
- the expected number of elements.d
- the number of hash functions; if the filter adds not more
than n
elements, false positives will happen with
probability 2-d.
initialize
protected void initialize(int n,
int d)
- Initializer shared by constructors.
- Parameters:
n
- the expected number of elements.d
- the number of hash functions; if the filter adds not more
than n
elements, false positives will happen with
probability 2-d.
forget
public void forget(java.lang.String canonical,
CandidateURI item)
- Description copied from interface:
UriUniqFilter
- Forget item was seen
- Specified by:
forget
in interface UriUniqFilter
- Overrides:
forget
in class SetBasedUriUniqFilter
- Parameters:
canonical
- Usually a canonicalized version of an URI
.
This is the key used doing lookups, forgets and insertions on the
already included list.item
- item to add.
setAdd
protected boolean setAdd(java.lang.CharSequence uri)
- Specified by:
setAdd
in class SetBasedUriUniqFilter
setCount
protected long setCount()
- Specified by:
setCount
in class SetBasedUriUniqFilter
setRemove
protected boolean setRemove(java.lang.CharSequence uri)
- Specified by:
setRemove
in class SetBasedUriUniqFilter
Copyright © 2003-2011 Internet Archive. All Rights Reserved.