I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.
You do make the key observation that a fast bit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this library useful. You may also want to tinker with the random number technique used below that only hashes your key a single time.
In terms of non-Java bit array implementations:
- Boost has dynamic_bitset
- Java has the built in BitSet
I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.
Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.
from BitVector import BitVector
from random import Random
# get hashes from http://www.partow.net/programming/hashfunctions/index.html
from hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash
#
# ryan.a.cox@gmail.com / www.asciiarmor.com
#
# copyright (c) 2008, ryan cox
# all rights reserved
# BSD license: http://www.opensource.org/licenses/bsd-license.php
#
class BloomFilter(object):
def __init__(self, n=None, m=None, k=None, p=None, bits=None ):
self.m = m
if k > 4 or k < 1:
raise Exception('Must specify value of k between 1 and 4')
self.k = k
if bits:
self.bits = bits
else:
self.bits = BitVector( size=m )
self.rand = Random()
self.hashes = []
self.hashes.append(RSHash)
self.hashes.append(JSHash)
self.hashes.append(PJWHash)
self.hashes.append(DJBHash)
# switch between hashing techniques
self._indexes = self._rand_indexes
#self._indexes = self._hash_indexes
def __contains__(self, key):
for i in self._indexes(key):
if not self.bits[i]:
return False
return True
def add(self, key):
dupe = True
bits = []
for i in self._indexes(key):
if dupe and not self.bits[i]:
dupe = False
self.bits[i] = 1
bits.append(i)
return dupe
def __and__(self, filter):
if (self.k != filter.k) or (self.m != filter.m):
raise Exception('Must use bloom filters created with equal k / m paramters for bitwise AND')
return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))
def __or__(self, filter):
if (self.k != filter.k) or (self.m != filter.m):
raise Exception('Must use bloom filters created with equal k / m paramters for bitwise OR')
return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))
def _hash_indexes(self,key):
ret = []
for i in range(self.k):
ret.append(self.hashes[i](key) % self.m)
return ret
def _rand_indexes(self,key):
self.rand.seed(hash(key))
ret = []
for i in range(self.k):
ret.append(self.rand.randint(0,self.m-1))
return ret
if __name__ == '__main__':
e = BloomFilter(m=100, k=4)
e.add('one')
e.add('two')
e.add('three')
e.add('four')
e.add('five')
f = BloomFilter(m=100, k=4)
f.add('three')
f.add('four')
f.add('five')
f.add('six')
f.add('seven')
f.add('eight')
f.add('nine')
f.add("ten")
# test check for dupe on add
assert not f.add('eleven')
assert f.add('eleven')
# test membership operations
assert 'ten' in f
assert 'one' in e
assert 'ten' not in e
assert 'one' not in f
# test set based operations
union = f | e
intersection = f & e
assert 'ten' in union
assert 'one' in union
assert 'three' in intersection
assert 'ten' not in intersection
assert 'one' not in intersection
Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:
def fast_count_bits( self, v ):
bits = (
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 )
return bits[v & 0xff] + bits[(v >> 8) & 0xff] + bits[(v >> 16) & 0xff] + bits[v >> 24]