org.archive.util
Class BloomFilter64bit

java.lang.Object
  extended by org.archive.util.BloomFilter64bit
All Implemented Interfaces:
java.io.Serializable, BloomFilter

public class BloomFilter64bit
extends java.lang.Object
implements java.io.Serializable, BloomFilter

A Bloom filter. ADAPTED/IMPROVED VERSION OF MG4J it.unimi.dsi.mg4j.util.BloomFilter

KEY CHANGES:


Instances of this class represent a set of character sequences (with false positives) using a Bloom filter. Because of the way Bloom filters work, you cannot remove elements.

Bloom filters have an expected error rate, depending on the number of hash functions used, on the filter size and on the number of elements in the filter. This implementation uses a variable optimal number of hash functions, depending on the expected number of elements. More precisely, a Bloom filter for n character sequences with d hash functions will use ln 2 dn ≈ 1.44 dn bits; false positives will happen with probability 2-d.

Hash functions are generated at creation time using universal hashing. Each hash function uses NUMBER_OF_WEIGHTS random integers, which are cyclically multiplied by the character codes in a character sequence. The resulting integers are XOR-ed together.

This class exports access methods that are very similar to those of Set, but it does not implement that interface, as too many non-optional methods would be unimplementable (e.g., iterators).

Author:
Sebastiano Vigna
See Also:
Serialized Form

Field Summary
protected static long ADDRESS_BITS_PER_UNIT
           
protected static long BIT_INDEX_MASK
           
protected  long[][] bits
          The underlying bit vector
protected  int d
          The number of hash functions used by this filter.
(package private) static boolean DEBUG
           
protected  long expectedInserts
          The expected number of inserts; determines calculated size
protected  long m
          The number of bits in this filter.
(package private) static double NATURAL_LOG_OF_2
          The natural logarithm of 2, used in the computation of the number of bits.
(package private) static int NUMBER_OF_WEIGHTS
          The number of weights used to create hash functions.
protected  int power
          if bitfield is an exact power of 2 in length, it is this power
(package private)  int size
          The number of elements currently in the filter.
protected static int SUBARRAY_LENGTH_IN_LONGS
          number of longs in one subarray
protected static int SUBARRAY_MASK
          mask for lowest SUBARRAY_POWER_OF_TWO bits
protected static int SUBARRAY_POWER_OF_TWO
          power-of-two to use as maximum size of bitfield subarrays
protected  long[][] weight
          The random integers used to generate the hash functions.
 
Constructor Summary
BloomFilter64bit(long n, int d)
          Creates a new Bloom filter with given number of hash functions and expected number of elements.
BloomFilter64bit(long n, int d, boolean roundUp)
           
BloomFilter64bit(long n, int d, java.util.Random weightsGenerator, boolean roundUp)
          Creates a new Bloom filter with given number of hash functions and expected number of elements.
 
Method Summary
 boolean add(java.lang.CharSequence s)
          Adds a character sequence to the filter.
 long[] bitIndexesFor(java.lang.CharSequence s)
           
 boolean contains(java.lang.CharSequence s)
          Checks whether the given character sequence is in this filter.
 boolean getBit(long bitIndex)
          Returns from the local bitvector the value of the bit with the specified index.
 long getExpectedInserts()
          Report the number of expected inserts used at instantiation time to calculate the bitfield size.
 long getHashCount()
          Report the number of internal independent hash function (and thus the number of bits set/checked for each item presented).
 long getSizeBytes()
          The amount of memory in bytes consumed by the bloom bitfield.
protected  long hash(java.lang.CharSequence s, int l, int k)
          Hashes the given sequence with the given hash function.
protected  void setBit(long bitIndex)
          Changes the bit with index bitIndex in local bitvector.
protected  boolean setGetBit(long bitIndex)
          Sets the bit with index bitIndex in local bitvector -- returning the old value.
 int size()
          The number of character sequences in the filter.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NUMBER_OF_WEIGHTS

static final int NUMBER_OF_WEIGHTS
The number of weights used to create hash functions.

See Also:
Constant Field Values

m

protected final long m
The number of bits in this filter.


power

protected int power
if bitfield is an exact power of 2 in length, it is this power


expectedInserts

protected final long expectedInserts
The expected number of inserts; determines calculated size


d

protected final int d
The number of hash functions used by this filter.


bits

protected final long[][] bits
The underlying bit vector


weight

protected final long[][] weight
The random integers used to generate the hash functions.


size

int size
The number of elements currently in the filter. It may be smaller than the actual number of additions of distinct character sequences because of false positives.


NATURAL_LOG_OF_2

static final double NATURAL_LOG_OF_2
The natural logarithm of 2, used in the computation of the number of bits.


SUBARRAY_POWER_OF_TWO

protected static final int SUBARRAY_POWER_OF_TWO
power-of-two to use as maximum size of bitfield subarrays

See Also:
Constant Field Values

SUBARRAY_LENGTH_IN_LONGS

protected static final int SUBARRAY_LENGTH_IN_LONGS
number of longs in one subarray

See Also:
Constant Field Values

SUBARRAY_MASK

protected static final int SUBARRAY_MASK
mask for lowest SUBARRAY_POWER_OF_TWO bits

See Also:
Constant Field Values

DEBUG

static final boolean DEBUG
See Also:
Constant Field Values

ADDRESS_BITS_PER_UNIT

protected static final long ADDRESS_BITS_PER_UNIT
See Also:
Constant Field Values

BIT_INDEX_MASK

protected static final long BIT_INDEX_MASK
See Also:
Constant Field Values
Constructor Detail

BloomFilter64bit

public BloomFilter64bit(long n,
                        int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements.

Parameters:
n - the expected number of elements.
d - the number of hash functions; if the filter add not more than n elements, false positives will happen with probability 2-d.

BloomFilter64bit

public BloomFilter64bit(long n,
                        int d,
                        boolean roundUp)

BloomFilter64bit

public BloomFilter64bit(long n,
                        int d,
                        java.util.Random weightsGenerator,
                        boolean roundUp)
Creates a new Bloom filter with given number of hash functions and expected number of elements.

Parameters:
n - the expected number of elements.
d - the number of hash functions; if the filter add not more than n elements, false positives will happen with probability 2-d.
Random - weightsGenerator may provide a seeded Random for reproducible internal universal hash function weighting
roundUp - if true, round bit size up to next-nearest-power-of-2
Method Detail

size

public int size()
The number of character sequences in the filter.

Specified by:
size in interface BloomFilter
Returns:
the number of character sequences in the filter (but see contains(CharSequence)).

hash

protected long hash(java.lang.CharSequence s,
                    int l,
                    int k)
Hashes the given sequence with the given hash function.

Parameters:
s - a character sequence.
l - the length of s.
k - a hash function index (smaller than d).
Returns:
the position in the filter corresponding to s for the hash function k.

bitIndexesFor

public long[] bitIndexesFor(java.lang.CharSequence s)

contains

public boolean contains(java.lang.CharSequence s)
Checks whether the given character sequence is in this filter.

Note that this method may return true on a character sequence that is has not been added to the filter. This will happen with probability 2-d, where d is the number of hash functions specified at creation time, if the number of the elements in the filter is less than n, the number of expected elements specified at creation time.

Specified by:
contains in interface BloomFilter
Parameters:
s - a character sequence.
Returns:
true if the sequence is in the filter (or if a sequence with the same hash sequence is in the filter).

add

public boolean add(java.lang.CharSequence s)
Adds a character sequence to the filter.

Specified by:
add in interface BloomFilter
Parameters:
s - a character sequence.
Returns:
true if the character sequence was not in the filter (but see contains(CharSequence)).

getBit

public boolean getBit(long bitIndex)
Returns from the local bitvector the value of the bit with the specified index. The value is true if the bit with the index bitIndex is currently set; otherwise, returns false. (adapted from cern.colt.bitvector.QuickBitVector)

Specified by:
getBit in interface BloomFilter
Parameters:
bitIndex - the bit index.
Returns:
the value of the bit with the specified index.

setBit

protected void setBit(long bitIndex)
Changes the bit with index bitIndex in local bitvector. (adapted from cern.colt.bitvector.QuickBitVector)

Parameters:
bitIndex - the index of the bit to be set.

setGetBit

protected boolean setGetBit(long bitIndex)
Sets the bit with index bitIndex in local bitvector -- returning the old value. (adapted from cern.colt.bitvector.QuickBitVector)

Parameters:
bitIndex - the index of the bit to be set.

getSizeBytes

public long getSizeBytes()
Description copied from interface: BloomFilter
The amount of memory in bytes consumed by the bloom bitfield.

Specified by:
getSizeBytes in interface BloomFilter
Returns:
memory used by bloom bitfield, in bytes

getExpectedInserts

public long getExpectedInserts()
Description copied from interface: BloomFilter
Report the number of expected inserts used at instantiation time to calculate the bitfield size.

Specified by:
getExpectedInserts in interface BloomFilter
Returns:
long number of inserts expected at instantiation

getHashCount

public long getHashCount()
Description copied from interface: BloomFilter
Report the number of internal independent hash function (and thus the number of bits set/checked for each item presented).

Specified by:
getHashCount in interface BloomFilter
Returns:
long count of hash functions


Copyright © 2003-2011 Internet Archive. All Rights Reserved.