I was looking at this pycon talk, 34:30 and the speaker says that getting the t
largest elements of a list of n
elements can be done in O(t + n)
.
How is that possible? My understanding is that creating the heap will be O(n)
, but what's the complexity of nlargest
itself, is it O(n + t)
or O(t)
(and what's the actual algorithm)?
The speaker is wrong in this case. The actual cost is
O(n * log(t))
. Heapify is called only on the firstt
elements of the iterable. That'sO(t)
, but is insignificant ift
is much smaller thann
. Then all the remaining elements are added to this "little heap" viaheappushpop
, one at a time. That takesO(log(t))
time per invocation ofheappushpop
. The length of the heap remainst
throughout. At the very end, the heap is sorted, which costsO(t * log(t))
, but that's also insignificant ift
is much smaller thann
.Fun with Theory ;-)
There are reasonably easy ways to find the t'th-largest element in expected
O(n)
time; for example, see here. There are harder ways to do it in worst-caseO(n)
time. Then, in another pass over the input, you could output thet
elements >= the t-th largest (with tedious complications in case of duplicates). So the whole job can be done inO(n)
time.But those ways require
O(n)
memory too. Python doesn't use them. An advantage of what's actually implemented is that the worst-case "extra" memory burden isO(t)
, and that can be very significant when the input is, for example, a generator producing a great many values.