Hashing floating point values

2019-06-15 00:51发布

Recently, I was curious how hash algorithms for floating points worked, so I looked at the source code for boost::hash_value. It turns out to be fairly complicated. The actual implementation loops over each digit in the radix and accumulates a hash value. Compared to the integer hash functions, it's much more involved.

My question is: why should a floating-point hash algorithm be any more complicated? Why not just hash the binary representation of the floating point value as if it was an integer?

Like:

std::size_t hash_value(float f)
{
  return hash_value(*(reinterpret_cast<int*>(&f)));
}

I realize that float is not guaranteed to be the same size as int on all systems, but that sort of thing could be handled with a few template meta-programs to deduce an integral type that is the same size as float. So what is the advantage of introducing an entirely different hash function that specifically operates on floating point types?

4条回答
我命由我不由天
2楼-- · 2019-06-15 01:04

Take a look at https://svn.boost.org/trac/boost/ticket/4038

In essence it boils down to two things:

  • Portability: when you take the binary representation of a float, then on some platform it could be possible that a float with a same value has multiple representations in binary. I don't know if there is actually a platform where such an issue exists, but with the complication of denormelized numbers, I'm not sure if this might actually happen.

  • the second issue is what you proposed, it might be that sizeof(float) does not equal sizeof(int).

I did not find anyone mentioning that the boost hash indeed avoids fewer collisions. Although I assume that separating the mantissa from the exponent might help, but the above link does not suggest that this was the driving design decision.

查看更多
叼着烟拽天下
3楼-- · 2019-06-15 01:07

Why are you wanting to hash floating point values? For the same reason that comparing floating point values for equality has a number of pitfalls, hashing them can have similar (negative) consequences.

However given that you really do want to do this, I suspect that the boost algorithm is complicated because when you take into account denormalized numbers different bit patterns can represent the same number (and should probably have the same hash). In IEEE 754 there are also both positive and negative 0 values that compare equal but have different bit patterns.

This probably wouldn't come up in the hashing if it wouldn't have come up otherwise in your algorithm but you still need to take care about signaling NaN values.

Additionally what would be the meaning of hashing +/- infinity and/or NaN? Specifically NaN can have many representations, should they all result in the same hash? Infinity seems to have just two representations so it seems like it would work out ok.

查看更多
成全新的幸福
4楼-- · 2019-06-15 01:16

One reason not to just use the bit pattern is that some different bit patterns must be considered equals and thus have the same hashcode, namely

  • positive and negative zero
  • possibly denormalized numbers (I don't think this can occur with IEEE 754, but C allows other float representations).
  • possibly NANs (there are many, at least in IEEE 754. It actually requires NAN patterns to be considered unequal to themselves, which arguably means the cannot be meaningfully used in a hashtable)
查看更多
等我变得足够好
5楼-- · 2019-06-15 01:23

I imagine it's so that two machines with incompatible floating-point formats hash to the same value.

查看更多
登录 后发表回答