I'm trying to implement a Count-Min Sketch algorithm in Scala, and so I need to generate k pairwise independent hash functions.
This is a lower-level than anything I've ever programmed before, and I don't know much about hash functions except from Algorithms classes, so my question is: how do I generate these k pairwise independent hash functions?
Am I supposed to use a hash function like MD5 or MurmurHash? Do I just generate k hash functions of the form f(x) = ax + b (mod p)
, where p is a prime and a and b are random integers? (i.e., the universal hashing family everyone learns in algorithms 101)
I'm looking more for simplicity than raw speed (e.g., I'll take something 5x slower if it's simpler to implement).
Probably the simplest approach is to take some cryptographic hash function and "seed" it with different sequences of bytes. For most practical purposes, the results should be independent, as this is one of the key properties a cryptographic hash function should have (if you replace any part of a message, the hash should be completely different).
I'd do something like:
Edit: I don't know the precise requirements of the Count-Min Sketch, maybe a simple has function would suffice, but it doesn't seem to be the simplest solution.
I suggested a cryptographic hash function, because there you have quite strong guarantees that the resulting hash functions will be very different, and it's easy to implement, just use the standard libraries.
On the other hand, if you have two hash functions of the form
f1(x) = ax + b (mod p)
andf2(x) = cx + d (mod p)
, then you can compute one using another (without knowingx
) using a simple linear formulaf2(x) = c / a * (f1(x) - b) + d (mod p)
, which suggests that they aren't very independent. So you could run into unexpected problems here.Scala already has
MurmurHash
implemented (it'sscala.util.MurmurHash
). It's very fast and very good at distributing values. A cryptographic hash is overkill--you'll just take tens or hundreds of times longer than you need to. Just pickk
different seeds to start with and, since it's nearly cryptographic in quality, you'll getk
largely independent hash codes. (In 2.10, you should probably switch to usingscala.util.hashing.MurmurHash3
; the usage is rather different but you can still do the same thing with mixing.)If you only need near values to be mapped to randomly far values this will work; if you want to avoid collisions (i.e. if A and B collide using hash 1 they will probably not also collide using hash 2), then you'll need to go at least one more step and hash not the whole object but subcomponents of it so there's an opportunity for the hashes to start out different.