Hashing function for four unsigned integers (C++)

2020-05-21 04:21发布

I'm writing a program right now which produces four unsigned 32-bit integers as output from a certain function. I'm wanting to hash these four integers, so I can compare the output of this function to future outputs.

I'm having trouble writing a decent hashing function though. When I originally wrote this code, I threw in a simple addition of each of the four integers, which I knew would not suffice. I've tried several other techniques, such as shifting and adding, to no avail. I get a hash, but it's of poor quality, and the function generate a ton of collisions.

The hash output can be either a 32-bit or 64-bit integer. The function in question generates many billions of hashes, so collisions are a real problem here, and I'm willing to use a larger variable to ensure that there are as few collisions as possible.

Can anyone help me figure out how to write a quality hash function?

标签: c++ hash integer
7条回答
啃猪蹄的小仙女
2楼-- · 2020-05-21 04:55

Try using CRC or FNV. FNV is nice because it is fast and has a defined method of folding bits to get "smaller" hash values (i.e. 12-bit / 24-bit / etc).

Also the benefit of generating a 64-bit hash from a 128-bit (4 X 32-bit) number is a bit questionable because as other people have suggested, you could just use the original value as a key in a set. You really want the number of bits in the hash to represent the number of values you originally have. For example, if your dataset has 100,000 4X32-bit values, you probably want a 17-bit or 18-bit hash value, not a 64-bit hash.

查看更多
爷、活的狠高调
3楼-- · 2020-05-21 04:59

Here's a fairly reasonable hash function from 4 integers to 1 integer:

unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];

With uniformly-distributed input it gives uniformly-distributed output. All bits of the input participate in the output, and every input value (although not every input bit) can affect every output bit. Chances are it's faster than the function which produces the output, in which case no performance concerns.

There are other hashes with other characteristics, but accumulate-with-multiplication-by-prime is a good start until proven otherwise. You could try accumulating with xor instead of addition if you like. Either way, it's easy to generate collisions (for example {1, 0, a, b} collides with {0, 37, a, b} for all a, b), so you might want to pick a prime which you think has nothing to do with any plausible implementation bug in your function. So if your function has a lot of modulo-37 arithmetic in it, maybe use 1000003 instead.

查看更多
Deceive 欺骗
4楼-- · 2020-05-21 05:09

Why a hash? It seems like a std::set or std::multi set would be better suited to store this kind of output. All you'd need to do is wrap the four integers up in a struct and write a simple compare function.

查看更多
成全新的幸福
5楼-- · 2020-05-21 05:10

I fully agree with Vinko - just compare them all. If you still want a good hashing function, you need to analyse the distribution of your 4 unsinged integers. Then you have to craft your hashing function in a way, that the result will be even distributed over the whole range of the 32 bit hashing value.

A simple example - let's just assume that most of the time, the result from each function is in the range from 0 to 255. Then you could easily blend the lower 8 bits from each function into your hash. Most of the time, you'd finde the result directly, just sometimes (when one function returns a larger result) you'd have a collision.

To sum it up - without information how the results of the 4 functions are distributed, we can't help you with a good hashing function.

查看更多
混吃等死
6楼-- · 2020-05-21 05:11

Might be a bit overkill, but consider Boost.Hash. Generates very simple code and good values.

查看更多
该账号已被封号
7楼-- · 2020-05-21 05:17

Why don't you store the four integers in a suitable data structure and compare them all? The benefit of hashing them in this case appears dubious to me, unless storage is a problem.

If storage is the issue, you can use one of the hash functions analyzed here.

查看更多
登录 后发表回答