I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value.
The function I've been using is given below:
int GetHash(const char* zKey, int iPrecision /*= 6*/)
{
/////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp
unsigned long h = 0;
long M = pow(10, iPrecision);
while(*zKey)
{
h = (h << 4) + *zKey++;
unsigned long g = h & 0xF0000000L;
if (g) h ^= g >> 24;
h &= ~g;
}
return (int) (h % M);
}
The string to be hashed is similar to "SAEUI1210.00000010_1".
However, this produces duplicate values in some cases.
Are there any good alternatives which wouldn't duplicate the same hash for different string values.
The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.
In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.
Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|)
. However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.
That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)
Every hash will have collisions. Period. That's called a Birthday Problem.
You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.
Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.
MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.