Using hash functions with Bloom filters

2020-02-29 11:29发布

问题:

A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a generic description of how to do about it would be fine too.

Thanks

回答1:

You should first convert the hash output to an unsigned integer, then reduce it modulo m. This looks like this:

MessageDigest md = MessageDigest.getInstance("MD5");
// hash data...
byte[] hashValue = md.digest();
BigInteger n = new BigInteger(1, hashValue);
n = n.mod(m);
// at that point, n has a value between 0 and m-1 (inclusive)

I have assumed that m is a BigInteger instance. If necessary, use BigInteger.valueOf(). Similarly, use n.intValue() or n.longValue() to get the value of n as one of the primitive types of Java.

The modular reduction is somewhat biased, but the bias is very small if m is substantially smaller than 2^128.



回答2:

Simplest way would probably be to just convert the hash output (as a byte sequence) to a single binary number and take that modulo m.