I'm currently implementing a hash table in C++ and I'm trying to make a hash function for floats...
I was going to treat floats as integers by padding the decimal numbers, but then I realized that I would probably reach the overflow with big numbers...
Is there a good way to hash floats?
You don't have to give me the function directly, but I'd like to see/understand different concepts...
Notes:
I don't need it to be really fast, just evenly distributed if possible.
I've read that floats should not be hashed because of the speed of computation, can someone confirm/explain this and give me other reasons why floats should not be hashed? I don't really understand why (besides the speed)
You can of course represent a
float
as anint
type of the same size to hash it, however this naive approach has some pitfalls you need to be careful of...Simply converting to a binary representation is error prone since values which are equal wont necessarily have the same binary representation.
An obvious case:
-0.0
wont match0.0
for example. *Further, simply converting to an
int
of the same size wont give very even distribution, which is often important (implementing a hash/set that uses buckets for example).Suggested steps for implementation:
nan
,inf
) and (0.0
,-0.0
whether you need to do this explicitly or not depends on the method used).int
of the same size(that is - use a union for example to represent the
float
as anint
, not simply cast to an int).*: You may wan't to check for (
nan
and-nan
) too. How to handle those exactly depends on your use case (you may want to ignore sign for allnan
's as CPython does).Python's
_Py_HashDouble
is a good reference for how you might hash afloat
, in production code (ignore the-1
check at the end, since that's a special value for Python).You can use the std hash, it's not bad:
If you're interested, I just made a hash function that uses floating point and can hash floats. It also passes SMHasher ( which is the main bias-test for non-crypto hash functions ). It's a lot slower than normal non-cryptographic hash functions due to the float calculations.
I'm not sure if tifuhash will become useful for all applications, but it's interesting to see a simple floating point function pass both PractRand and SMHasher.
The main state update function is very simple, and looks like:
Anyway, you can get it on npm Or you can check out the github
Using is simple:
There's a demo of some hashes on runkit here https://runkit.com/593a239c56ebfd0012d15fc9/593e4d7014d66100120ecdb9
Side note: I think that in future using floating point,possibly big arrays of floating point calculations, could be a useful way to make more computationally-demanding hash functions in future. A weird side effect I discovered of using floating point is that the hashes are target dependent, and I surmise maybe they could be use to fingerprint the platforms they were calculated on.
If your hash function did the following you'd get some degree of fuzziness on the hash lookup
This way you'll mask off the 12 least significant bits allowing for a degree of uncertainty ... It really depends on yout application however.
Technically undefined behavior, but most compilers support this. Alternative solution:
Both solutions depend on the endianness of your machine, so for example on x86 and SPARC, they will produce different results. If that doesn't bother you, just use one of these solutions.
It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.
EDIT:
Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers
a
andb
should be equal to each other because the math says so, you need to pick some relatively smalldelta > 0
, and then you declarea
andb
to be equal ifabs(a-b) < delta
, whereabs
is the absolute value function. For more detail, see this article.Here is a small example that demonstrates the problem:
Depending on your platform, compiler and optimization levels, this may print
ooops...
to your screen, meaning that the mathematical equationx / y * y = x
does not necessarily hold on your computer.There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.