Many books and tutorials say that the size of a hash table must be a prime to evenly distribute the keys in all the buckets. But Java's HashMap
always uses a size that is a power of two. Shouldn't it be using a prime? What's better, a "prime" or a "power of two" as the hash table size?
问题:
回答1:
Using a power of two effectively masks out top bits of the hash code. Thus a poor-quality hash function might perform particularly badly in this scenario.
Java's HashMap
mitigates this by mistrusting the object's hashCode()
implementation and applying a second level of hashing to its result:
Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits.
If you have a good hash function, or do something similar to what HashMap
does, it does not matter whether you use prime numbers etc as the table size.
If, on the other hand, the hash function is of unknown or poor quality, then using a prime number would be a safer bet. It will, however, make dynamically-sized tables tricker to implement, since all of a sudden you need to be able to produce prime numbers instead of just multiplying the size by a constant factor.
回答2:
The standard HashMap implementation has a hash
method which rehashes your object's hashcode to avoid that pitfall. The comment before the hash()
method reads:
/**
* Retrieve object hash code and applies a supplemental hash function to the
* result hash, which defends against poor quality hash functions. This is
* critical because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
回答3:
The only way to know which is better between prime and power-of-two is to benchmark it.
Many years ago, when writing an assembler whose performance depended strongly on symbol talbe lookup, I tested this using a large block of generated identifiers. Even with a naive mapping, I found that power-of-two, as expected, had less even distribution and longer chains than a similar sized prime number of buckets. It still ran faster, because of the speed of bucket selection by bit masking.
I strongly suspect the java.util developers would not have resorted to the extra hashing and power-of-two without benchmarking it against using a prime number of buckets. It is a really obvious thing to do when designing a hashed data structure.
For that reason, I'm sure the rehash and power-of-two size gives better performance for typical Java hash maps than a prime number of buckets.
回答4:
From a performance/calculation time point of view power-of-two sizes can be calculated with just bit masking which is faster than integer division which would be required otherwise.
回答5:
You probably should use prime sized hash tables if you use quadratic probing for collision resolution. If you have a prime sized table, quadratic probing will hit half of the entries, less if it is not a prime. So you might not find a suitable place to store you entry even if your hash table is less than half full. Since Java hash maps don't use quadratic probing, there is no need to use primes as size.