A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex
string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a generic description of how to do about it would be fine too.
Thanks
You should first convert the hash output to an unsigned integer, then reduce it modulo m. This looks like this:
MessageDigest md = MessageDigest.getInstance("MD5");
// hash data...
byte[] hashValue = md.digest();
BigInteger n = new BigInteger(1, hashValue);
n = n.mod(m);
// at that point, n has a value between 0 and m-1 (inclusive)
I have assumed that m is a BigInteger
instance. If necessary, use BigInteger.valueOf()
. Similarly, use n.intValue()
or n.longValue()
to get the value of n as one of the primitive types of Java.
The modular reduction is somewhat biased, but the bias is very small if m is substantially smaller than 2^128.
Simplest way would probably be to just convert the hash output (as a byte sequence) to a single binary number and take that modulo m.