In heapsort
, the data is stored in something called a "heap". Almost all the implementations I've seen use a flat list for the data structure.
Can someone explain to me why this is?
Why not use nested arrays or an instance of a binary Tree? Isn't explicit better than implicit?
Is it because of implementation difficulties like traversing the structure, or something else?
If you want to see how heapsort can be implemented in Python then look no further than the standard library module heapq
. Python has both C and Python implementations of heapsort and the heapq
module defines the Python ones then overwrites them (if available) with the C ones. That means you can read and understand the Python implementation but get the benefit of the C version if you actually use it.
A quick example of using the module is given at the end:
heap = []
data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
for item in data:
heappush(heap, item)
sort = []
while heap:
sort.append(heappop(heap))
print sort
A heap is represented by a partially sorted list which has the constraint that for every element at index n in the list, the relationship holds that heap[n] <= heap[n*2+1] and heap[n] <= heap[n*2+2]
(ignoring elements that don't exist). This is a simple way to collapse a binary tree down to a simple list for ease of storage.
heappush()
puts a new element into the list maintaining that invariant, heappop()
removes the smallest element. heapify(somelist)
reorders the list in-place to satisfy the invariant.
heapsort is very useful when you want to only sort part of the list (give me the smallest k items), or where you want to process the smallest items while continually receiving new items that go into the list. A good example of the latter would be an operating system task scheduler where you keep a heap of runnable threads in priority order and can quickly pop the highest priority runnable thread off the heap whenever you need to schedule a thread to run.
Edit: There are several reasons why a list/array is more appropriate for heap storage than an explicit tree structure. The most obvious ones are that the explicit tree has a greater memory overhead (either involving pointers within each object or a separate object to be allocated for each object in the heap) and is also slower as any time an object moves within the heap you have to update multiple pointers to children and possibly parents.
Slightly less obvious is that you need to be able to easily get at the last element which is easy in a list but would mean you also need to store and update sibling pointers on each element. The reason why you need to be able to get at the last element easily is that to add an element you make it the last element and then reorder it with respect to its parent and sibling (an O(log n) operation) or to remove the smallest you simply put the current last element in its place and reorder downwards. If you don't have O(1) access to the final element of the tree then both of these operations take a bad performance hit.
You should take a look at it you will find a lot of important thing about the sorting algorithms:
sorting
On the other hand heap is a tree-based data structure with special properties. In case of maximum heap the greatest element is in the root node and if B is a child node of A, then key(A) ≥ key(B).
Heaps are crucial in several efficient graph algorithms such as Dijkstra's algorithm, and in the sorting algorithm heapsort.
You should also check out this about heap sort.
[EDIT] Python-related implementation of heap sort can be found here.