Expected worst-case time complexity of chained has

2019-05-11 11:10发布

问题:

When implementing a hash table using a good hash function (one where the probability of any two elements colliding is 1 / m, where m is the number of buckets), it is well-known that the average-case running time for looking up an element is Θ(1 + α), where α is the load factor. The worst-case running time is O(n), though, if all the elements end up put into the same bucket.

I was recently doing some reading on hash tables and found this article which claims (on page 3) that if α = 1, the expected worst-case complexity is Θ(log n / log log n). By "expected worst-case complexity," I mean, on expectation, the maximum amount of work you'll have to do if the elements are distributed by a uniform hash function. This is different from the actual worst-case, since the worst-case behavior (all elements in the same bucket) is extremely unlikely to actually occur.

My question is the following - the author seems to suggest that differing the value of α can change the expected worst-case complexity of a lookup. Does anyone know of a formula, table, or article somewhere that discusses how changing α changes the expected worst-case runtime?

回答1:

For fixed α, the expected worst time is always Θ(log n / log log n). However if you make α a function of n, then the expected worst time can change. For instance if α = O(n) then the expected worst time is O(n) (that's the case where you have a fixed number of hash buckets).

In general the distribution of items into buckets is approximately a Poisson distribution, the odds of a random bucket having i items is αi e / i!. The worst case is just the m'th worst out of m close to independent observations. (Not entirely independent, but fairly close to it.) The m'th worst out of m observations tends to be something whose odds of happening are about 1/m times. (More precisely the distribution is given by a Β distribution, but for our analysis 1/m is good enough.)

As you head into the tail of the Poisson distribution the growth of the i! term dominates everything else, so the cumulative probability of everything above a given i is smaller than the probability of selecting i itself. So to a good approximation you can figure out the expected value by solving for:

αi e-α / i! = 1/m = 1/(n/α) = α/n

Take logs of both sides and we get:

i log(α) - α - (i log(i) - i + O(log(i)) = log(α) - log(n)
log(n) - α = i log(i) - i - i log(α) + O(log(i))

If we hold α constant then this is:

log(n) = i log(i) + O(i)

Can this work if i has the form k log(n) / log(log(n)) with k = Θ(1)? Let's try it:

log(n) = (k log(n) / log(log(n))) (log(k) + log(log(n)) - log(log(log(n)))) + O(log(log(n)))
       = k (log(n) + o(log(n)) + o(log(n))

And then we get the sharper estimate that, for any fixed load average α, the expected worst time is (1 + o(1)) log(n) / log(log(n))



回答2:

After some searching, I came across this research paper that gives a complete analysis of the expected worst-case behavior of a whole bunch of different types of hash tables, including chained hash tables. The author gives as an answer that the expected length is approximately Γ-1(m), where m is the number of buckets and Γ is the Gamma function. Assuming that α is a constant, this is approximately ln m / ln ln m.

Hope this helps!