Can someone help explain how can building a heap be O(n) complexity?
Inserting an item into a heap is O(log n)
, and the insert is repeated n/2 times (the remainder are leaves, and can't violate the heap property). So, this means the complexity should be O(n log n)
, I would think.
In other words, for each item we "heapify", it has the potential to have to filter down once for each level for the heap so far (which is log n levels).
What am I missing?
think you're making a mistake. Take a look at this: http://golang.org/pkg/container/heap/ Building a heap isn'y O(n). However, inserting is O(lg(n). I'm assuming initialization is O(n) if you set a heap size b/c the heap needs to allocate space and set up the data structure. If you have n items to put into the heap then yes, each insert is lg(n) and there are n items, so you get n*lg(n) as u stated
Intuitively:
Not quite. Your logic does not produce a tight bound -- it over estimates the complexity of each heapify. If built from the bottom up, insertion (heapify) can be much less than
O(log(n))
. The process is as follows:( Step 1 ) The first
n/2
elements go on the bottom row of the heap.h=0
, so heapify is not needed.( Step 2 ) The next
n/22
elements go on the row 1 up from the bottom.h=1
, heapify filters 1 level down.( Step i ) The next
n/2i
elements go in rowi
up from the bottom.h=i
, heapify filtersi
levels down.( Step log(n) ) The last
n/2log2(n) = 1
element goes in rowlog(n)
up from the bottom.h=log(n)
, heapify filterslog(n)
levels down.NOTICE: that after step one,
1/2
of the elements(n/2)
are already in the heap, and we didn't even need to call heapify once. Also, notice that only a single element, the root, actually incurs the fulllog(n)
complexity.Theoretically:
The Total steps
N
to build a heap of sizen
, can be written out mathematically.At height
i
, we've shown (above) that there will ben/2i+1
elements that need to call heapify, and we know heapify at heighti
isO(i)
. This gives:The solution to the last summation can be found by taking the derivative of both sides of the well known geometric series equation:
Finally, plugging in
x = 1/2
into the above equation yields2
. Plugging this into the first equation gives:Thus, the total number of steps is of size
O(n)
Successive insertions can be described by:
By starling approximation,
n! =~ O(n^(n + O(1)))
, thereforeT =~ O(nlog(n))
Hope this helps, the optimal way
O(n)
is using the build heap algorithm for a given set (ordering doesn't matter).I think there are several questions buried in this topic:
Often, answers to these questions focus on the difference between
siftUp
andsiftDown
. Making the correct choice betweensiftUp
andsiftDown
is critical to get O(n) performance forbuildHeap
, but does nothing to help one understand the difference betweenbuildHeap
andheapSort
in general. Indeed, proper implementations of bothbuildHeap
andheapSort
will only usesiftDown
. ThesiftUp
operation is only needed to perform inserts into an existing heap, so it would be used to implement a priority queue using a binary heap, for example.I've written this to describe how a max heap works. This is the type of heap typically used for heap sort or for a priority queue where higher values indicate higher priority. A min heap is also useful; for example, when retrieving items with integer keys in ascending order or strings in alphabetical order. The principles are exactly the same; simply switch the sort order.
The heap property specifies that each node in a binary heap must be at least as large as both of its children. In particular, this implies that the largest item in the heap is at the root. Sifting down and sifting up are essentially the same operation in opposite directions: move an offending node until it satisfies the heap property:
siftDown
swaps a node that is too small with its largest child (thereby moving it down) until it is at least as large as both nodes below it.siftUp
swaps a node that is too large with its parent (thereby moving it up) until it is no larger than the node above it.The number of operations required for
siftDown
andsiftUp
is proportional to the distance the node may have to move. ForsiftDown
, it is the distance from the bottom of the tree, sosiftDown
is expensive for nodes at the top of the tree. WithsiftUp
, the work is proportional to the distance from the top of the tree, sosiftUp
is expensive for nodes at the bottom of the tree. Although both operations are O(log n) in the worst case, in a heap, only one node is at the top whereas half the nodes lie in the bottom layer. So it shouldn't be too surprising that if we have to apply an operation to every node, we would prefersiftDown
oversiftUp
.The
buildHeap
function takes an array of unsorted items and moves them until they all satisfy the heap property, thereby producing a valid heap. There are two approaches one might take forbuildHeap
using thesiftUp
andsiftDown
operations we've described.Start at the top of the heap (the beginning of the array) and call
siftUp
on each item. At each step, the previously sifted items (the items before the current item in the array) form a valid heap, and sifting the next item up places it into a valid position in the heap. After sifting up each node, all items satisfy the heap property.Or, go in the opposite direction: start at the end of the array and move backwards towards the front. At each iteration, you sift an item down until it is in the correct location.
Both of these solutions will produce a valid heap. The question is: which implementation for
buildHeap
is more efficient? Unsurprisingly, it is the second operation that usessiftDown
.Let h = log n represent the height of the heap. The work required for the
siftDown
approach is given by the sumEach term in the sum has the maximum distance a node at the given height will have to move (zero for the bottom layer, h for the root) multiplied by the number of nodes at that height. In contrast, the sum for calling
siftUp
on each node isIt should be clear that the second sum is larger. The first term alone is hn/2 = 1/2 n log n, so this approach has complexity at best O(n log n). But how do we prove that the sum for the
siftDown
approach is indeed O(n)? One method (there are other analyses that also work) is to turn the finite sum into an infinite series and then use Taylor series. We may ignore the first term, which is zero:If you aren't sure why each of those steps works, here is a justification for the process in words:
Since the infinite sum is exactly n, we conclude that the finite sum is no larger, and is therefore, O(n).
The next question is: if it is possible to run
buildHeap
in linear time, why does heap sort require O(n log n) time? Well, heap sort consists of two stages. First, we callbuildHeap
on the array, which requires O(n) time if implemented optimally. The next stage is to repeatedly delete the largest item in the heap and put it at the end of the array. Because we delete an item from the heap, there is always an open spot just after the end of the heap where we can store the item. So heap sort achieves a sorted order by successively removing the next largest item and putting it into the array starting at the last position and moving towards the front. It is the complexity of this last part that dominates in heap sort. The loop looks likes this:Clearly, the loop runs O(n) times (n - 1 to be precise, the last item is already in place). The complexity of
deleteMax
for a heap is O(log n). It is typically implemented by removing the root (the largest item left in the heap) and replacing it with the last item in the heap, which is a leaf, and therefore one of the smallest items. This new root will almost certainly violate the heap property, so you have to callsiftDown
until you move it back into an acceptable position. This also has the effect of moving the next largest item up to the root. Notice that, in contrast tobuildHeap
where for most of the nodes we are callingsiftDown
from the bottom of the tree, we are now callingsiftDown
from the top of the tree on each iteration! Although the tree is shrinking, it doesn't shrink fast enough: The height of the tree stays constant until you have removed the first half of the nodes (when you clear out the bottom layer completely). Then for the next quarter, the height is h - 1. So the total work for this second stage isNotice the switch: now the zero work case corresponds to a single node and the h work case corresponds to half the nodes. This sum is O(n log n) just like the inefficient version of
buildHeap
that is implemented using siftUp. But in this case, we have no choice since we are trying to sort and we require the next largest item be removed next.In summary, the work for heap sort is the sum of the two stages: O(n) time for buildHeap and O(n log n) to remove each node in order, so the complexity is O(n log n). You can prove (using some ideas from information theory) that for a comparison-based sort, O(n log n) is the best you could hope for anyway, so there's no reason to be disappointed by this or expect heap sort to achieve the O(n) time bound that
buildHeap
does."The linear time bound of build Heap, can be shown by computing the sum of the heights of all the nodes in the heap, which is the maximum number of dashed lines. For the perfect binary tree of height h containing N = 2^(h+1) – 1 nodes, the sum of the heights of the nodes is N – H – 1. Thus it is O(N)."
As we know the height of a heap is log(n), where n is the total number of elements.Lets represent it as h
When we perform heapify operation, then the elements at last level(h) won't move even a single step.
The number of elements at second last level(h-1) is 2h-1 and they can move at max 1 level(during heapify).
Similarly, for the ith, level we have 2i elements which can move h-i positions.
Therefore total number of moves=S= 2h*0+2h-1*1+2h-2*2+...20*h
S=2h {1/2 + 2/22 + 3/23+ ... h/2h} -------------------------------------------------1
this is AGP series, to solve this divide both sides by 2
S/2=2h {1/22 + 2/23+ ... h/2h+1} -------------------------------------------------2
subtracting equation 2 from 1 gives
S/2=2h {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
S=2h+1 {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
now 1/2+1/22 + 1/23+ ...+1/2h is decreasing GP whose sum is less than 1 (when h tends to infinity, the sum tends to 1). In further analysis, let's take an upper bound on the sum which is 1.
This gives S=2h+1{1+h/2h+1}
=2h+1+h
~2h+h
as h=log(n), 2h=n
Therefore S=n+log(n)
T(C)=O(n)