Size of the hash table

2019-01-28 13:07发布

问题:

Let the size of the hash table to be static (I set it once). I want to set it according to the number of entries. Searching yielded that the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.

For simplicity, assume that the hash table will not accept any new entries and won't delete any.

The number of entries will be 200, 2000, 20000 and 2000000.

However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?

I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.

I using C and I want to build my own structure, for educating myself.

回答1:

the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.

It certainly shouldn't. Probably this recommendation implies that load factor of 0.5 is good tradeoff, at least by default.

What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:

If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.

The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.

The entries will be 200, 2000, 20000 and 2000000.

I don't understand what did you mean by this.

However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?

The general rule is called space-time tradeoff: the more memory you allocate for hash table, the faster hash table operate. Here you can find some charts illustrating this. So, if you think that by assigning table size ~ 2 * N you would waste memory, you can freely choose smaller size, but be ready that operations on hash table will become slower on average.

I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.

It's impossible to avoid collisions completely (remember birthday paradox? :) Certain ratio of collisions is an ordinary situation. This ratio affects only average operation speed, see the previous section.



回答2:

The answer to your question depends somewhat on the quality of your hash function. If you have a good quality hash function (i.e. one where on average, the bits of the hash code will be "distributed evenly"), then:

  • the necessity to have a prime number of buckets disappears;
  • you can expect the number of items per bucket to be Poisson distributed.

So firstly, the advice to use a prime number of buckets is is essentially a kludge to help alleviate situations where you have a poor hash function. Provided that you have a good quality hash function, it's not clear that there are really any constraints per se on the number of buckets, and one common choice is to use a power of two so that the modulo is just a bitwise AND (though either way, it's not crucial nowadays). A good hash table implementation will include a secondary hash to try and alleviate the situation where the original hash function is of poor quality-- see the source code to Java's HashTable for an example.

A common load factor is 0.75 (i.e. you have 100 buckets for every 75 entries). This translates to approximately 50% of buckets having just one single entry in them-- so it's good performance-wise-- though of couse it also wastes some space. What the "correct" load factor is for you depends on the time/space tradeoff that you want to make.

In very high-performance applications, a potential design consideration is also how you actually organise the structure/buckets in memory to maximise CPU cache performance. (The answer to what is the "best" structure is essentially "the one that performs best in your experiments with your data".)



标签: c hash hashtable