Hash function for floats

2020-01-31 05:28发布

I'm currently implementing a hash table in C++ and I'm trying to make a hash function for floats...

I was going to treat floats as integers by padding the decimal numbers, but then I realized that I would probably reach the overflow with big numbers...

Is there a good way to hash floats?

You don't have to give me the function directly, but I'd like to see/understand different concepts...

Notes:

  1. I don't need it to be really fast, just evenly distributed if possible.

  2. I've read that floats should not be hashed because of the speed of computation, can someone confirm/explain this and give me other reasons why floats should not be hashed? I don't really understand why (besides the speed)

6条回答
够拽才男人
2楼-- · 2020-01-31 06:02

You can of course represent a float as an int type of the same size to hash it, however this naive approach has some pitfalls you need to be careful of...

Simply converting to a binary representation is error prone since values which are equal wont necessarily have the same binary representation.

An obvious case: -0.0 wont match 0.0 for example. *

Further, simply converting to an int of the same size wont give very even distribution, which is often important (implementing a hash/set that uses buckets for example).

Suggested steps for implementation:

  • filter out non-finite cases (nan, inf) and (0.0, -0.0 whether you need to do this explicitly or not depends on the method used).
  • convert to an int of the same size
    (that is - use a union for example to represent the float as an int, not simply cast to an int).
  • re-distribute the bits, (intentionally vague here!), this is basically a speed vs quality tradeoff. But if you have many values in a small range you probably don't want them to in a similar range too.

*: You may wan't to check for (nan and -nan) too. How to handle those exactly depends on your use case (you may want to ignore sign for all nan's as CPython does).

Python's _Py_HashDouble is a good reference for how you might hash a float, in production code (ignore the -1 check at the end, since that's a special value for Python).

查看更多
疯言疯语
3楼-- · 2020-01-31 06:14

You can use the std hash, it's not bad:

 std::size_t myHash = std::cout << std::hash<float>{}(myFloat);
查看更多
Ridiculous、
4楼-- · 2020-01-31 06:15

If you're interested, I just made a hash function that uses floating point and can hash floats. It also passes SMHasher ( which is the main bias-test for non-crypto hash functions ). It's a lot slower than normal non-cryptographic hash functions due to the float calculations.

I'm not sure if tifuhash will become useful for all applications, but it's interesting to see a simple floating point function pass both PractRand and SMHasher.

The main state update function is very simple, and looks like:

function q( state, val, numerator, denominator ) {
  // Continued Fraction mixed with Egyptian fraction "Continued Egyptian Fraction"
  // with denominator = val + pos / state[1]
  state[0] += numerator / denominator;
  state[0] = 1.0 / state[0];

  // Standard Continued Fraction with a_i = val, b_i = (a_i-1) + i + 1
  state[1] += val;
  state[1] = numerator / state[1];
}

Anyway, you can get it on npm Or you can check out the github

Using is simple:

const tifu = require('tifuhash');

const message = 'The medium is the message.';
const number = 333333333;
const float = Math.PI;

console.log( tifu.hash( message ), 
  tifu.hash( number ),
  tifu.hash( float ),
tifu.hash( ) );

There's a demo of some hashes on runkit here https://runkit.com/593a239c56ebfd0012d15fc9/593e4d7014d66100120ecdb9

Side note: I think that in future using floating point,possibly big arrays of floating point calculations, could be a useful way to make more computationally-demanding hash functions in future. A weird side effect I discovered of using floating point is that the hashes are target dependent, and I surmise maybe they could be use to fingerprint the platforms they were calculated on.

查看更多
爷的心禁止访问
5楼-- · 2020-01-31 06:18

If your hash function did the following you'd get some degree of fuzziness on the hash lookup

unsigned int Hash( float f )
{
    unsigned int ui;
    memcpy( &ui, &f, sizeof( float ) );
    return ui & 0xfffff000;
}

This way you'll mask off the 12 least significant bits allowing for a degree of uncertainty ... It really depends on yout application however.

查看更多
劳资没心,怎么记你
6楼-- · 2020-01-31 06:18
unsigned hash(float x)
{
    union
    {
        float f;
        unsigned u;
    };
    f = x;
    return u;
}

Technically undefined behavior, but most compilers support this. Alternative solution:

unsigned hash(float x)
{
    return (unsigned&)x;
}

Both solutions depend on the endianness of your machine, so for example on x86 and SPARC, they will produce different results. If that doesn't bother you, just use one of these solutions.

查看更多
太酷不给撩
7楼-- · 2020-01-31 06:25

It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.

EDIT:

Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers a and b should be equal to each other because the math says so, you need to pick some relatively small delta > 0, and then you declare a and b to be equal if abs(a-b) < delta, where abs is the absolute value function. For more detail, see this article.

Here is a small example that demonstrates the problem:

float x = 1.0f;
x = x / 41;
x = x * 41;
if (x != 1.0f)
{
    std::cout << "ooops...\n";
}

Depending on your platform, compiler and optimization levels, this may print ooops... to your screen, meaning that the mathematical equation x / y * y = x does not necessarily hold on your computer.

There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.

查看更多
登录 后发表回答