knuth multiplicative hash

2019-03-14 04:19发布

Is this a correct implementation of the Knuth multiplicative hash.

int hash(int v)
{
    v *= 2654435761;
    return v >> 32;
}

Does overflow in the multiplication affects the algorithm?

How to improve the performance of this method?

4条回答
成全新的幸福
2楼-- · 2019-03-14 04:57

Knuth multiplicative hash is used to compute an hash value in {0, 1, 2, ..., 2^p - 1} from an integer k.

Suppose that p is in between 0 and 32, the algorithm goes like this:

  • Compute alpha as the closest integer to 2^32 (-1 + sqrt(5)) / 2. We get alpha = 2 654 435 769.

  • Compute k * alpha and reduce the result modulo 2^32:

    k * alpha = n0 * 2^32 + n1 with 0 <= n1 < 2^32

  • Keep the highest p bits of n1:

    n1 = m1 * 2^(32-p) + m2 with 0 <= m2 < 2^(32 - p)

So, a correct implementation of Knuth multiplicative algorithm in C++ is:

std::uint32_t knuth(int x, int p) {
    assert(p >= 0 && p <= 32);

    const std::uint32_t knuth = 2654435769;
    const std::uint32_t y = x;
    return (y * knuth) >> (32 - p);
}

Forgetting to shift the result by (32 - p) is a major mistake. As you would lost all the good properties of the hash. It would transform an even sequence into an even sequence which would be very bad as all the odd slots would stay unoccupied. That's like taking a good wine and mixing it with Coke. By the way, the web is full of people misquoting Knuth and using a multiplication by 2 654 435 761 without taking the higher bits. I just opened the Knuth and he never said such a thing. It looks like some guy who decided he was "smart" decided to take a prime number close to 2 654 435 769.

Bare in mind that most hash tables implementations don't allow this kind of signature in their interface, as they only allow

uint32_t hash(int x);

and reduce hash(x) modulo 2^p to compute the hash value for x. Those hash tables cannot accept the Knuth multiplicative hash. This might be a reason why so many people completely ruined the algorithm by forgetting to take the higher p bits. So you can't use the Knuth multiplicative hash with std::unordered_map or std::unordered_set. But I think that those hash tables use a prime number as a size, so the Knuth multiplicative hash is not useful in this case. Using hash(x) = x would be a good fit for those tables.

Source: "Introduction to Algorithms, third edition", Cormen et al., 13.3.2 p:263

Source: "The Art of Computer Programming, Volume 3, Sorting and Searching", D.E. Knuth, 6.4 p:516

查看更多
甜甜的少女心
3楼-- · 2019-03-14 04:59

If the input argument is a pointer then I use this

#include <inttypes.h>

uint32_t knuth_mul_hash(void* k) {
  ptrdiff_t v = (ptrdiff_t)k * UINT32_C(2654435761);
  v >>= ((sizeof(ptrdiff_t) - sizeof(uint32_t)) * 8); // Right-shift v by the size difference between a pointer and a 32-bit integer (0 for x86, 32 for x64)
  return (uint32_t)(v & UINT32_MAX);
}

I usually use this as the default fallback hashing function in hashmap implementations, dictionaries, sets, etc...

查看更多
爷、活的狠高调
4楼-- · 2019-03-14 05:09

Ok, I looked it up in TAOCP volume 3 (2nd edition), section 6.4, page 516.

This implementation is not correct, though as I mentioned in the comments it may give the correct result anyway.

A correct way (I think - feel free to read the relevant chapter of TAOCP and verify this) is something like this: (important: yes, you must shift the result right to reduce it, not use bitwise AND. However, that is not the responsibility of this function - range reduction is not properly part of hashing itself)

uint32_t hash(uint32_t v)
{
    return v * UINT32_C(2654435761);
    // do not comment about the lack of right shift. I'm not ignoring it. read on.
}

Note the uint32_t's (as opposed to int's) - they make sure the multiplication overflows modulo 2^32, as it is supposed to do if you choose 32 as the word size. There is also no right shift by k here, because there is no reason to give responsibility for range-reduction to the basic hashing function and it is actually more useful to get the full result. The constant 2654435761 is from the question, the actual suggested constant is 2654435769, but that's a small difference that as far as I know does not affect the quality of the hash.

Other valid implementations shift the result right by some amount (not the full word size though, that doesn't make sense and C++ doesn't like it), depending on how many bits of hash you need. Or they may use an other constant (subject to certain conditions) or an other word size. Reducing the hash modulo something is not a valid implementation, but a common mistake, likely it is a de-facto standard way to do range-reduction on a hash. The bottom bits of a multiplicative hash are the worst-quality bits (they depend on less of the input), you only want to use them if you really need more bits, while reducing the hash modulo a power of two would return only the worst bits. Indeed that is equivalent to throwing away most of the input bits too. Reducing modulo a non-power-of-two is not so bad since it does mix in the higher bits, but it's not how the multiplicative hash was defined.

So to be clear, yes there is a right shift, but that is range reduction not hashing and can only be the responsibility of the hash table, since it depends on its internal size.

The type should be unsigned, otherwise the overflow is unspecified (thus possibly wrong, not just on non-2's-complement architectures but also on overly clever compilers) and the optional right shift would be a signed shift (wrong).

On the page I mention at the top, there is this formula:

knuth formula

Here we have A = 2654435761 (or 2654435769), w = 232 and M = 232. Calculating AK/w gives a fixed-point result with the format Q32.32, the mod 1 step takes only the 32 fraction bits. But that's just the same thing as doing a modular multiplication and then saying that the result is the fraction bits. Of course when multiplied by M, all the fraction bits become integer bits because of how M was chosen, and so it simplifies to just a plain old modular multiplication. When M is a lower power of two, that just right-shifts the result, as mentioned.

查看更多
迷人小祖宗
5楼-- · 2019-03-14 05:11

Might be late, but heres a Java Implementation of Knuth's Method :

For a hashtable of Size N :

public long hash(int key) {
    long l = 2654435769L;
    return (key * l >> 32) % N ;
}
查看更多
登录 后发表回答