Worst-case O(n) algorithm for doing k-selection

2019-01-22 05:12发布

问题:

Apart from the median-of-medians algorithm, is there any other way to do k-selection in worst-case O(n) time? Does implementing median-of-medians make sense; I mean, is the performance advantage good enough for practical purposes ?

回答1:

There is another algorithm for computing kth order statistics based on the soft heap data structure, which is a variant on a standard priority queue that is allowed to "corrupt" some number of the priorities it stores. The algorithm is described in more detail on the Wikipedia article, but the basic idea is to use the soft heap to efficiently (O(n) time) pick a pivot for the partition function that has a guarantee of a good split. In a sense, this is simply a modified version of the median-of-medians algorithm that uses an (arguably) more straightforward approach to choosing the pivot element.

Soft heaps are not particularly intuitive, but there is a pretty good description of them available in this paper ("A simpler implementation and analysis of Chazelle's soft heaps"), which includes a formal description and analysis if the data structure.

However, if you want a really fast, worst-case O(n) algorithm, consider looking into introselect. This algorithm is actually quite brilliant. It starts off by using the quickselect algorithm, which picks a pivot unintelligently and uses it to partition the data. This is extremely fast in practice, but has bad worst-case behavior. Introselect fixes this by keeping track of an internal counter that tracks its progress. If the algorithm ever looks like it's about to degrade to O(n2) time, it switches algorithms and uses something like median-of-medians to ensure the worst-case guarantee. Specifically, it watches how much of the array is discarded at each step, and if some constant number of steps occur before half the input is discarded, the algorithm switches to the median-of-medians algorithm to ensure that the next pivot is good before then restarting using quickselect. This guarantees worst-case O(n) time.

The advantage of this algorithm is that it's extremely fast on most inputs (since quickselect is very fast), but has great worst-case behavior. A description of this algorithm, along with the related sorting algorithm introsort, can be found in this paper ("Introspective Sorting and Selection Algorithms").

Hope this helps!



回答2:

I think that you should really test it and find out what the performance is, when you have N million elements in your container. This algorithm has already been implemented in the STL library (C++) as std::nth_element is guarantueed to be expected O(n). So if you used C++, you could easily run some tests and see if the performance is good enough for what you seek.

A notable exception is C++, which provides a templated nth_element method with a guarantee of expected linear time.



回答3:

It depends. If you're concerned about the worst case happening accidentally, I wouldn't bother. As the data grows large enough to care, the worst case becomes so unlikely that it's hardly worth protecting against.

If you're doing the selection in a situation where a client could provide the data in the worst-case order to do a denial of service on your server, then it's probably worth using a median of medians to assure that the worst-case order doesn't hurt performance to any significant degree.



回答4:

Updated:

There is a linear time algorithm, a modification to quick sort, suggest by quicksort's inventor Hoare himself. I suggest referring to the section 9.3 "Selection in worst-case linear time" in CLRS book. Here is the brief algorithm, assuming we have a method random_partition from quicksort (which uses a random pivot for partition):

FindKth(array, l, u, k)
{
   int m = random_partition(array, l, u);
   if m == k : return array[k] /*we have found the kth element*/
   if m > k: return FindKth(array, l, m-1, k); /* we have found element > kth largest, concentrate on the left partition */
   else: return FindKth(array, m+1, u, k-m); /* find the k-m th element in the right partition */
}

You can also refer to Donald Knuth's TAOCP Vol.3 Sorting and Searching p.633 The beauty of this method is that, the array need not be completely sorted! I think the STL nth_permutation uses this technique, you can refer to the notes section.