Which sort algorithm works best on mostly sorted d

2019-01-01 11:53发布

站内文章 / 前沿技术

27 0

倾城一夜雪

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Which sorting algorithm works best on mostly sorted data?

回答1:

Based on the highly scientific method of watching animated gifs I would say Insertion and Bubble sorts are good candidates.

回答2:

Only a few items => INSERTION SORT

Items are mostly sorted already => INSERTION SORT

Concerned about worst-case scenarios => HEAP SORT

Interested in a good average-case result => QUICKSORT

Items are drawn from a dense universe => BUCKET SORT

Desire to write as little code as possible => INSERTION SORT

回答3:

timsort

Timsort is \"an adaptive, stable, natural mergesort\" with \"supernatural performance on many kinds of partially ordered arrays (less than lg(N!) comparisons needed, and as few as N-1)\". Python\'s built-in sort() has used this algorithm for some time, apparently with good results. It\'s specifically designed to detect and take advantage of partially sorted subsequences in the input, which often occur in real datasets. It is often the case in the real world that comparisons are much more expensive than swapping items in a list, since one typically just swaps pointers, which very often makes timsort an excellent choice. However, if you know that your comparisons are always very cheap (writing a toy program to sort 32-bit integers, for instance), other algorithms exist that are likely to perform better. The easiest way to take advantage of timsort is of course to use Python, but since Python is open source you might also be able to borrow the code. Alternately, the description above contains more than enough detail to write your own implementation.

回答4:

Insertion sort with the following behavior:

For each element k in slots 1..n, first check whether el[k] >= el[k-1]. If so, go to next element. (Obviously skip the first element.)
If not, use binary-search in elements 1..k-1 to determine the insertion location, then scoot the elements over. (You might do this only if k>T where T is some threshold value; with small k this is overkill.)

This method makes the least number of comparisons.

回答5:

Try introspective sort. http://en.wikipedia.org/wiki/Introsort

It\'s quicksort based, but it avoids the worst case behaviour that quicksort has for nearly sorted lists.

The trick is that this sort-algorithm detects the cases where quicksort goes into worst-case mode and switches to heap or merge sort. Nearly sorted partitions are detected by some non naiive partition method and small partitions are handled using insertion sort.

You get the best of all major sorting algorithms for the cost of a more code and complexity. And you can be sure you\'ll never run into worst case behaviour no matter how your data looks like.

If you\'re a C++ programmer check your std::sort algorithm. It may already use introspective sort internally.

回答6:

Splaysort is an obscure sorting method based on splay trees, a type of adaptive binary tree. Splaysort is good not only for partially sorted data, but also partially reverse-sorted data, or indeed any data that has any kind of pre-existing order. It is O(nlogn) in the general case, and O(n) in the case where the data is sorted in some way (forward, reverse, organ-pipe, etc.).

Its great advantage over insertion sort is that it doesn\'t revert to O(n^2) behaviour when the data isn\'t sorted at all, so you don\'t need to be absolutely sure that the data is partially sorted before using it.

Its disadvantage is the extra space overhead of the splay tree structure it needs, as well as the time required to build and destroy the splay tree. But depending on the size of data and amount of pre-sortedness that you expect, the overhead may be worth it for the increase in speed.

A paper on splaysort was published in Software--Practice & Experience.

回答7:

insertion or shell sort!

回答8:

Dijkstra\'s smoothsort is a great sort on already-sorted data. It\'s a heapsort variant that runs in O(n lg n) worst-case and O(n) best-case. I wrote an analysis of the algorithm, in case you\'re curious how it works.

Natural mergesort is another really good one for this - it\'s a bottom-up mergesort variant that works by treating the input as the concatenation of multiple different sorted ranges, then using the merge algorithm to join them together. You repeat this process until all of the input range is sorted. This runs in O(n) time if the data is already sorted and O(n lg n) worst-case. It\'s very elegant, though in practice it isn\'t as good as some other adaptive sorts like Timsort or smoothsort.

回答9:

Insertion sort takes time O(n + the number of inversions).

An inversion is a pair (i, j) such that i < j && a[i] > a[j]. That is, an out-of-order pair.

One measure of being \"almost sorted\" is the number of inversions---one could take \"almost sorted data\" to mean data with few inversions. If one knows the number of inversions to be linear (for instance, you have just appended O(1) elements to a sorted list), insertion sort takes O(n) time.

回答10:

If elements are already sorted or there are only few elements, it would be a perfect use case for Insertion Sort!

回答11:

As everyone else said, be careful of naive Quicksort - that can have O(N^2) performance on sorted or nearly sorted data. Nevertheless, with an appropriate algorithm for choice of pivot (either random or median-of-three - see Choosing a Pivot for Quicksort), Quicksort will still work sanely.

In general, the difficulty with choosing algorithms such as insert sort is in deciding when the data is sufficiently out of order that Quicksort really would be quicker.

回答12:

I\'m not going to pretend to have all the answers here, because I think getting at the actual answers may require coding up the algorithms and profiling them against representative data samples. But I\'ve been thinking about this question all evening, and here\'s what\'s occurred to me so far, and some guesses about what works best where.

Let N be the number of items total, M be the number out-of-order.

Bubble sort will have to make something like 2*M+1 passes through all N items. If M is very small (0, 1, 2?), I think this will be very hard to beat.

If M is small (say less than log N), insertion sort will have great average performance. However, unless there\'s a trick I\'m not seeing, it will have very bad worst case performance. (Right? If the last item in the order comes first, then you have to insert every single item, as far as I can see, which will kill the performance.) I\'m guessing there\'s a more reliable sorting algorithm out there for this case, but I don\'t know what it is.

If M is bigger (say equal or great than log N), introspective sort is almost certainly best.

Exception to all of that: If you actually know ahead of time which elements are unsorted, then your best bet will be to pull those items out, sort them using introspective sort, and merge the two sorted lists together into one sorted list. If you could quickly figure out which items are out of order, this would be a good general solution as well -- but I haven\'t been able to figure out a simple way to do this.

Further thoughts (overnight): If M+1 < N/M, then you can scan the list looking for a run of N/M in a row which are sorted, and then expand that run in either direction to find the out-of-order items. That will take at most 2N comparisons. You can then sort the unsorted items, and do a sorted merge on the two lists. Total comparisons should less than something like 4N+M log2(M), which is going to beat any non-specialized sorting routine, I think. (Even further thought: this is trickier than I was thinking, but I still think it\'s reasonably possible.)

Another interpretation of the question is that there may be many of out-of-order items, but they are very close to where they should be in the list. (Imagine starting with a sorted list and swapping every other item with the one that comes after it.) In that case I think bubble sort performs very well -- I think the number of passes will be proportional to the furthest out of place an item is. Insertion sort will work poorly, because every out of order item will trigger an insertion. I suspect introspective sort or something like that will work well, too.

回答13:

If you are in need of specific implementation for sorting algorithms, data structures or anything that have a link to the above, could I recommend you the excellent \"Data Structures and Algorithms\" project on CodePlex?

It will have everything you need without reinventing the wheel.

Just my little grain of salt.

回答14:

This nice collection of sorting algorithms for this purpose in the answers, seems to lack Gnome Sort, which would also be suitable, and probably requires the least implementation effort.

回答15:

Insertion sort is best case O(n) on sorted input. And it is very close on mostly sorted input (better than quick sort).

回答16:

ponder Try Heap. I believe it\'s the most consistent of the O(n lg n) sorts.

回答17:

Bubble-sort (or, safer yet, bi-directional bubble sort) is likely ideal for mostly sorted lists, though I bet a tweaked comb-sort (with a much lower initial gap size) would be a little faster when the list wasn\'t quite as perfectly sorted. Comb sort degrades to bubble-sort.

回答18:

well it depends on use case. If you know which elements is changed, remove and insert will be the best case as far as I am concerned.

回答19:

Bubble sort is definitely the winner The next one on the radar would be insertion sort.

回答20:

Keep away from QuickSort - its very inefficient for pre-sorted data. Insertion sort handles almost sorted data well by moving as few values as possible.

标签： algorithm sorting