C++ - Why is boost::hash_combine the best way to c

2019-01-31 17:10发布

问题:

I've read in other posts that this seems to be the best way to combine hash-values. Could somebody please break this down and explain why this is the best way to do it?

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

Edit: The other question is only asking for the magic number, but I'd like to get know about the whole function, not only this part.

回答1:

It being the "best" is argumentative.

It being "good", or even "very good", at least superficially, is easy.

seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);

We'll presume seed is a previous result of hasher or this algorithm.

^= means that the bits on the left and bits on the right all change the bits of the result.

hasher(v) is presumed to be a decent hash on v. But the rest is defence in case it isn't a decent hash.

0x9e3779b9 is a 32 bit value (it could be extended to 64 bit if size_t was 64 bit arguably) that contains half 0s and half 1s. It is basically a random series of 0s and 1s done by approximating particular irrational constant as a base-2 fixed point value. This helps ensure that if the hasher returns bad values, we still get a smear of 1s and 0s in our output.

(seed<<6) + (seed>>2) is a bit shuffle of the incoming seed.

Imagine the 0x constant was missing. Imagine the hasher returns the constant 0x01000 for almost every v passed in. Now, each bit of the seed is spread out over the next iteration of the hash, during which it is again spread out.

The seed ^= (seed<<6) + (seed>>2) 0x00001000 becomes 0x00041400 after one iteration. Then 0x00859500. As you repeat the operation, any set bits are "smeared out" over the output bits. Eventually the right and left bits collide, and carry moves the set bit from "even locations" to "odd locations".

The bits dependent on the value of an input seed grows relatively fast and in complex ways as the combine operation recurses on the seed operation. Adding causes carries, which smear things even more. The 0x constant adds a bunch of pseudo-random bits that make boring hash values occupy more than a few bits of the hash space after being combined.

It is asymmetric thanks to addition (combining the hashes of "dog" and "god" gives different results), it handles boring hash values (mapping characters to their ascii value, which only involves twiddling a handful of bits). And, it is reasonably fast.

Slower hash combines that are cryptographically strong can be better in other situations. I, naively, would presume that making the shifts be a combination of even and odd shifts might be a good idea (but maybe addition, which moves even bits from odd bits, makes that less of a problem: after 3 iterations, incoming lone seed bits will collide and add and cause a carry).

The downside to this kind of analysis is that it only takes one mistake to make a hash function really bad. Pointing out all the good things doesn't help that much. So another thing that makes it good now is that it is reasonably famous and in an open-source repository, and I haven't heard anyone point out why it is bad.



回答2:

It's not the best, surprisingly to me it's not even particularily good. Figure 1: The entropic effect of a single bit change in one of two random 32 bit numbers being combined to a single 32 bit number using boost::hash_combine

Figure 2: The effect of a single bit change in one of two random 32 bit numbers on the result of boost::hash_combine

The entropic effect of a single bit change in either value that is being combined needs to be at least log(2) [black line]. As can be seen in figure 1 this is not the case for the highest bit of the seed value and a little tight for the second to highest value as well. This means that statistically the high bits in the seed are being lost. Using bit-rotations instead of bit-shifts, xor or addition with carry instead of simple addition one could easily create a similar hash_combine that preserves entropy better. Also when hash and seed both are of low entropy a hash_combine that spreads more would be prefereable. The rotation that maximizes this spread is the golden section if the number of hashes that are to be combined is not known in advance or is large. Using these ideas I propose the following hash_combine, that is using 6 operations just like boost but is achieving more chaotic hash behavior and preserves the input bits better. Of course one can always go wild and win the contest by adding in just a single multiplication by an uneven integer, this will distribute the hashes very well.

Figure 3: The entropic effect of a single bit change in one of two random 32 bit numbers being combined to a single 32 bit number using the proposed hash_combine alternative

Figure 4: The effect of a single bit change in one of two random 32 bit numbers on the result of the proposed hash_combine alternative

#include <iostream>
#include <limits>
#include <cmath>
#include <random>
#include <bitset>
#include <iomanip>
#include "wmath.hpp"

using wmath::popcount;
using wmath::reverse;

using std::cout;
using std::endl;
using std::bitset;
using std::setw;


constexpr uint32_t hash_combine_boost(const uint32_t& a, const uint32_t& b){
  return a^( b + 0x9e3779b9 + (a<<6) + (a>>2) );
}

template <typename T,typename S>
typename std::enable_if<std::is_unsigned<T>::value,T>::type
constexpr rol(const T n, const S i){
  const T m = (std::numeric_limits<T>::digits-1);
  const T c = i&m;
  return (n<<c)|(n>>((-c)&m)); // this is usually recognized by the compiler to mean rotation, try it with godbolt
}

template <typename T,typename S>
typename std::enable_if<std::is_unsigned<T>::value,T>::type
constexpr ror(const T n, const S i){
  const T m = (std::numeric_limits<T>::digits-1);
  const T c = i&m;
  return (n>>c)|(n<<((-c)&m)); // this is usually recognized by the compiler to mean rotation, try it with godbolt
}

template <typename T>
typename std::enable_if<std::is_unsigned<T>::value,T>::type
constexpr circadd(const T& a,const T& b){
  const T t = a+b;
  return t+(t<a);
}

template <typename T>
typename std::enable_if<std::is_unsigned<T>::value,T>::type
constexpr circdff(const T& a,const T& b){
  const T t = a-b;
  return t-(t>a);
}

constexpr uint32_t hash_combine_proposed(const uint32_t&seed, const uint32_t& v){
  return rol(circadd(seed,v),19)^circdff(seed,v);
}

int main(){
  size_t boost_similarity[32*64]    = {0};
  size_t proposed_similarity[32*64] = {0};
  std::random_device urand;
  std::mt19937 mt(urand());
  std::uniform_int_distribution<uint32_t> bit(0,63);
  std::uniform_int_distribution<uint32_t> rnd;
  const size_t N = 1ull<<24;
  uint32_t a,b,c;
  size_t collisions_boost=0,collisions_proposed=0;
  for(size_t i=0;i!=N;++i){
    const size_t n = bit(mt);
    uint32_t t0 = rnd(mt);
    uint32_t t1 = rnd(mt);
    uint32_t t2 = t0;
    uint32_t t3 = t1;
    if (n>31){
      t2^=1ul<<(n-32);
    }else{
      t3^=1ul<<n;
    }
    a = hash_combine_boost(t0,t1);
    b = hash_combine_boost(t2,t3);
    c = a^b;
    size_t count = 0;
    for (size_t i=0;i!=32;++i) boost_similarity[n*32+i]+=(0!=(c&(1ul<<i)));
    a = hash_combine_proposed(t0,t1);
    b = hash_combine_proposed(t2,t3);
    c = a^b;
    for (size_t i=0;i!=32;++i) proposed_similarity[n*32+i]+=(0!=(c&(1ul<<i)));
  }

  for (size_t j=0;j!=64;++j){
    for (size_t i=0;i!=32;++i){
      cout << setw(12) << boost_similarity[j*32+i] << " ";
    }
    cout << endl;
  }

  for (size_t j=0;j!=64;++j){
    for (size_t i=0;i!=32;++i){
      cout << setw(12) << proposed_similarity[j*32+i] << " "; 
    }
    cout << endl;
  }
}