After reading Guido's Sorting a million 32-bit integers in 2MB of RAM using Python, I discovered the heapq
module, but the concept is pretty abstract to me.
One reason is that I don't understand the concept of a heap completely, but I do understand how Guido used it.
Now, beside his kinda crazy example, what would you use the heapq
module for?
Must it always be related to sorting or minimum value? Is it only something you use because it's faster than other approaches? Or can you do really elegant things that you can't do without?
The heapq module is commonly use to implement priority queues.
You see priority queues in event schedulers that are constantly adding new events and need to use a heap to efficiently locate the next scheduled event. Some examples include:
- Python's own sched module: http://hg.python.org/cpython/file/2.7/Lib/sched.py#l106
- The Tornado web server: https://github.com/facebook/tornado/blob/master/tornado/ioloop.py#L260
- Twisted internet servers: http://twistedmatrix.com/trac/browser/trunk/twisted/internet/base.py#L712
The heapq docs include priority queue implementation notes which address the common use cases.
In addition, heaps are great for implementing partial sorts. For example, heapq.nsmallest and heapq.nlargest can be much more memory efficient and do many fewer comparisons than a full sort followed by a slice:
>>> from heapq import nlargest
>>> from random import random
>>> nlargest(5, (random() for i in xrange(1000000)))
[0.9999995650034837, 0.9999985756262746, 0.9999971934450994, 0.9999960394998497, 0.9999949126363714]
Comparing it to a self-balancing binary tree, a heap doesn't seem to gain you much if you just look at complexity:
- Insertion: O(logN) for both
- Remove max element: O(logN) for both
- Build structure from an array of elements O(N) for heap, O(N log N) for binary tree.
But whereas a binary tree tends to need each node pointing to its children for efficiency, a heap stores its data packed tightly into an array. This allows you to store much more data in a fixed amount of memory.
So for the cases when you only need insertion and max-removal, a heap is perfect and can often use half as much memory as a self-balancing binary tree (and much easier to implement if you have to). The standard use-case is a priority queue.
This was an accidental discovery of me by trying to see how could I implement the Counter Module in Python 2.6. Just have a look into the implementation and usage of collections.Counter. This is actually implemented through heapq.