How to pick prime numbers to calculate the hash co

2020-02-26 00:37发布

This question follows on the answer given by Jon Skeet on the question: "What is the best algorithm for an overridden System.Object.GetHashCode?". To calculate the hash code the following algorithm is used:

public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        // Suitable nullity checks etc, of course :)
        hash = hash * 23 + field1.GetHashCode();
        hash = hash * 23 + field2.GetHashCode();
        hash = hash * 23 + field3.GetHashCode();
        return hash;
    }
}

I don't understand why the numbers 17 and 23 are chosen. Why don't we pick 3 and 5? That are prime numbers as well. Can somebody explain what the best prime numbers to pick are and why?

1条回答
何必那么认真
2楼-- · 2020-02-26 01:14

The comments on the answer you link to already briefly try to explain why 17 and 23 are not good primes to use here.

A lot of .NET classes that make use of hash codes store elements in buckets. Suppose there are three buckets. Then all objects with hash code 0, 3, 6, 9, ... get stored in bucket 0. All objects with hash code 1, 4, 7, 10, ... get stored in bucket 1. All objects with bucket 2, 5, 8, 11, ... get stored in bucket 2.

Now suppose that your GetHashCode() uses hash = hash * 3 + field3.GetHashCode();. This would mean that unless hash is large enough for the multiplication to wrap around, in a hash set with three buckets, which bucket an object would end up in depends only on field3.

With an uneven distribution of objects across buckets, HashSet<T> cannot give good performance.

You want a factor that is co-prime to all possible number of buckets. The number of buckets itself will be prime, for the same reasons, therefore if your factor is prime, the only risk is that it's equal to the number of buckets.

.NET uses a fixed list of allowed numbers of buckets:

public static readonly int[] primes = {
    3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919,
    1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
    17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437,
    187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263,
    1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};

Your factor should be one that .NET doesn't use, and that other custom implementations are equally unlikely to use. This means 23 is a bad factor. 31 could be okay with .NET's own containers, but could be equally bad with custom implementations.

At the same time, it should not be so low that it gives lots of collisions for common uses. This is a risk with 3 and 5: suppose you have a custom Tuple<int, int> implementation with lots of small integers. Keep in mind that int.GetHashCode() just returns that int itself. Suppose your multiplication factor is 3. That means that (0, 9), (1, 6), (2, 3) and (3, 0) all give the same hash codes.

Both of the problems can be avoided by using sufficiently large primes, as pointed out in a comment that Jon Skeet had incorporated into his answer:

EDIT: As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good...

Once upon a time, large primes for multiplication may have been bad because multiplication by large integers was sufficiently slow that the performance difference was noticeable. Multiplication by 31 would be good in that case because it can be implemented as x * 31 => x * 32 - x => (x << 5) - x. Nowadays, though, the multiplication is far less likely to cause any performance problems, and then, generally speaking, the bigger the better.

查看更多
登录 后发表回答