I'm wondering what is the complexity of the function most_common
provided by the collections.Counter
object in python 2.7.
More specifically, is the Counter
keeping some kind of sorted list while it is being updated, allowing to perform the most_common
operation faster than O(n)
when n
is the number of (unique) items added to the counter? For you information, I am processing some large amount of text data trying to find the n-th most frequent tokens.
I checked the official documentation (https://docs.python.org/2/library/collections.html#collections.Counter), the CPython wiki (https://wiki.python.org/moin/TimeComplexity) but I could not find the answer. Thank you in advance!
Romain.
The source show exactly what happens:
From the source code of collections.py, we see that if we don't specify a number of returned elements,
most_common
returns a sorted list of the counts. This is anO(n log n)
algorithm.If we use
most_common
to returnk > 1
elements, then we use nlargest method ofheapq
. This is anO(k) + O((n - k) log k) + O(k log k)
algorithm, which is very good for a small constantk
, since it's essentialy linear. TheO(k)
part comes from heapifying the initialk
counts, the second part fromn - k
calls toheappushpop
method and the third part from sorting the final heap ofk
elements. Sincek <= n
we can conclude that the complexity is:If
k = 1
then it's easy to show that the complexity is: