When implementing a hash table using a good hash function (one where the probability of any two elements colliding is 1 / m, where m is the number of buckets), it is well-known that the average-case running time for looking up an element is Θ(1 + α), where α is the load factor. The worst-case running time is O(n), though, if all the elements end up put into the same bucket.
I was recently doing some reading on hash tables and found this article which claims (on page 3) that if α = 1, the expected worst-case complexity is Θ(log n / log log n). By "expected worst-case complexity," I mean, on expectation, the maximum amount of work you'll have to do if the elements are distributed by a uniform hash function. This is different from the actual worst-case, since the worst-case behavior (all elements in the same bucket) is extremely unlikely to actually occur.
My question is the following - the author seems to suggest that differing the value of α can change the expected worst-case complexity of a lookup. Does anyone know of a formula, table, or article somewhere that discusses how changing α changes the expected worst-case runtime?
For fixed α
, the expected worst time is always Θ(log n / log log n)
. However if you make α
a function of n
, then the expected worst time can change. For instance if α = O(n)
then the expected worst time is O(n)
(that's the case where you have a fixed number of hash buckets).
In general the distribution of items into buckets is approximately a Poisson distribution, the odds of a random bucket having i
items is αi e-α / i!
. The worst case is just the m
'th worst out of m
close to independent observations. (Not entirely independent, but fairly close to it.) The m
'th worst out of m
observations tends to be something whose odds of happening are about 1/m
times. (More precisely the distribution is given by a Β distribution, but for our analysis 1/m
is good enough.)
As you head into the tail of the Poisson distribution the growth of the i!
term dominates everything else, so the cumulative probability of everything above a given i
is smaller than the probability of selecting i
itself. So to a good approximation you can figure out the expected value by solving for:
αi e-α / i! = 1/m = 1/(n/α) = α/n
Take logs of both sides and we get:
i log(α) - α - (i log(i) - i + O(log(i)) = log(α) - log(n)
log(n) - α = i log(i) - i - i log(α) + O(log(i))
If we hold α
constant then this is:
log(n) = i log(i) + O(i)
Can this work if i
has the form k log(n) / log(log(n))
with k = Θ(1)
? Let's try it:
log(n) = (k log(n) / log(log(n))) (log(k) + log(log(n)) - log(log(log(n)))) + O(log(log(n)))
= k (log(n) + o(log(n)) + o(log(n))
And then we get the sharper estimate that, for any fixed load average α
, the expected worst time is (1 + o(1)) log(n) / log(log(n))
After some searching, I came across this research paper that gives a complete analysis of the expected worst-case behavior of a whole bunch of different types of hash tables, including chained hash tables. The author gives as an answer that the expected length is approximately Γ-1(m), where m is the number of buckets and Γ is the Gamma function. Assuming that α is a constant, this is approximately ln m / ln ln m.
Hope this helps!