|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.util.BloomFilter64bit
public class BloomFilter64bit
A Bloom filter. ADAPTED/IMPROVED VERSION OF MG4J it.unimi.dsi.mg4j.util.BloomFilter
KEY CHANGES:
Instances of this class represent a set of character sequences (with false positives) using a Bloom filter. Because of the way Bloom filters work, you cannot remove elements.
Bloom filters have an expected error rate, depending on the number of hash functions used, on the filter size and on the number of elements in the filter. This implementation uses a variable optimal number of hash functions, depending on the expected number of elements. More precisely, a Bloom filter for n character sequences with d hash functions will use ln 2 dn ≈ 1.44 dn bits; false positives will happen with probability 2-d.
Hash functions are generated at creation time using universal hashing.
Each hash function uses NUMBER_OF_WEIGHTS
random integers, which
are cyclically multiplied by the character codes in a character sequence.
The resulting integers are XOR-ed together.
This class exports access methods that are very similar to those of
Set
, but it does not implement that interface, as too
many non-optional methods would be unimplementable (e.g., iterators).
Field Summary | |
---|---|
protected static long |
ADDRESS_BITS_PER_UNIT
|
protected static long |
BIT_INDEX_MASK
|
protected long[][] |
bits
The underlying bit vector |
protected int |
d
The number of hash functions used by this filter. |
(package private) static boolean |
DEBUG
|
protected long |
expectedInserts
The expected number of inserts; determines calculated size |
protected long |
m
The number of bits in this filter. |
(package private) static double |
NATURAL_LOG_OF_2
The natural logarithm of 2, used in the computation of the number of bits. |
(package private) static int |
NUMBER_OF_WEIGHTS
The number of weights used to create hash functions. |
protected int |
power
if bitfield is an exact power of 2 in length, it is this power |
(package private) int |
size
The number of elements currently in the filter. |
protected static int |
SUBARRAY_LENGTH_IN_LONGS
number of longs in one subarray |
protected static int |
SUBARRAY_MASK
mask for lowest SUBARRAY_POWER_OF_TWO bits |
protected static int |
SUBARRAY_POWER_OF_TWO
power-of-two to use as maximum size of bitfield subarrays |
protected long[][] |
weight
The random integers used to generate the hash functions. |
Constructor Summary | |
---|---|
BloomFilter64bit(long n,
int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements. |
|
BloomFilter64bit(long n,
int d,
boolean roundUp)
|
|
BloomFilter64bit(long n,
int d,
java.util.Random weightsGenerator,
boolean roundUp)
Creates a new Bloom filter with given number of hash functions and expected number of elements. |
Method Summary | |
---|---|
boolean |
add(java.lang.CharSequence s)
Adds a character sequence to the filter. |
long[] |
bitIndexesFor(java.lang.CharSequence s)
|
boolean |
contains(java.lang.CharSequence s)
Checks whether the given character sequence is in this filter. |
boolean |
getBit(long bitIndex)
Returns from the local bitvector the value of the bit with the specified index. |
long |
getExpectedInserts()
Report the number of expected inserts used at instantiation time to calculate the bitfield size. |
long |
getHashCount()
Report the number of internal independent hash function (and thus the number of bits set/checked for each item presented). |
long |
getSizeBytes()
The amount of memory in bytes consumed by the bloom bitfield. |
protected long |
hash(java.lang.CharSequence s,
int l,
int k)
Hashes the given sequence with the given hash function. |
protected void |
setBit(long bitIndex)
Changes the bit with index bitIndex in local bitvector. |
protected boolean |
setGetBit(long bitIndex)
Sets the bit with index bitIndex in local bitvector -- returning the old value. |
int |
size()
The number of character sequences in the filter. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
static final int NUMBER_OF_WEIGHTS
protected final long m
protected int power
protected final long expectedInserts
protected final int d
protected final long[][] bits
protected final long[][] weight
int size
static final double NATURAL_LOG_OF_2
protected static final int SUBARRAY_POWER_OF_TWO
protected static final int SUBARRAY_LENGTH_IN_LONGS
protected static final int SUBARRAY_MASK
static final boolean DEBUG
protected static final long ADDRESS_BITS_PER_UNIT
protected static final long BIT_INDEX_MASK
Constructor Detail |
---|
public BloomFilter64bit(long n, int d)
n
- the expected number of elements.d
- the number of hash functions; if the filter add not more
than n
elements, false positives will happen with
probability 2-d.public BloomFilter64bit(long n, int d, boolean roundUp)
public BloomFilter64bit(long n, int d, java.util.Random weightsGenerator, boolean roundUp)
n
- the expected number of elements.d
- the number of hash functions; if the filter add not more
than n
elements, false positives will happen with
probability 2-d.Random
- weightsGenerator may provide a seeded Random for reproducible
internal universal hash function weightingroundUp
- if true, round bit size up to next-nearest-power-of-2Method Detail |
---|
public int size()
size
in interface BloomFilter
contains(CharSequence)
).protected long hash(java.lang.CharSequence s, int l, int k)
s
- a character sequence.l
- the length of s
.k
- a hash function index (smaller than d
).
s
for the hash function k
.public long[] bitIndexesFor(java.lang.CharSequence s)
public boolean contains(java.lang.CharSequence s)
Note that this method may return true on a character sequence that is has not been added to the filter. This will happen with probability 2-d, where d is the number of hash functions specified at creation time, if the number of the elements in the filter is less than n, the number of expected elements specified at creation time.
contains
in interface BloomFilter
s
- a character sequence.
public boolean add(java.lang.CharSequence s)
add
in interface BloomFilter
s
- a character sequence.
contains(CharSequence)
).public boolean getBit(long bitIndex)
getBit
in interface BloomFilter
bitIndex
- the bit index.
protected void setBit(long bitIndex)
bitIndex
- the index of the bit to be set.protected boolean setGetBit(long bitIndex)
bitIndex
- the index of the bit to be set.public long getSizeBytes()
BloomFilter
getSizeBytes
in interface BloomFilter
public long getExpectedInserts()
BloomFilter
getExpectedInserts
in interface BloomFilter
public long getHashCount()
BloomFilter
getHashCount
in interface BloomFilter
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |