String to Integer Hashing Function with Precision

2019-02-24 02:21发布

问题:

I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value. The function I've been using is given below:

int GetHash(const char* zKey, int iPrecision /*= 6*/)
{
        /////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp

        unsigned long h = 0;
        long M = pow(10, iPrecision);

        while(*zKey)
        {
                h = (h << 4) + *zKey++;
                unsigned long g = h & 0xF0000000L;
                if (g) h ^= g >> 24;
                h &= ~g;
        }            

        return (int) (h % M);
}

The string to be hashed is similar to "SAEUI1210.00000010_1".

However, this produces duplicate values in some cases. Are there any good alternatives which wouldn't duplicate the same hash for different string values.

回答1:

The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.

In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.

Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|). However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.

That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)



回答2:

Every hash will have collisions. Period. That's called a Birthday Problem.

You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.



回答3:

Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.



回答4:

MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.



标签: c++ hash