If I have a very large list stored in external memory that needs to be sorted. Asumimg this list is too large for internal memory, what major factors should be considered in designing an external sorting algorithm?
相关问题
- How to toggle on Order in ReactJS
- PHP Recursively File Folder Scan Sorted by Modific
- Finding k smallest elements in a min heap - worst-
- binary search tree path list
- Change sort order of strings includes with a speci
相关文章
- What are the problems associated to Best First Sea
- Coin change DP solution to keep track of coins
- Sort TreeView Automatically Upon Adding Nodes
- Algorithm for partially filling a polygonal mesh
- Robust polygon normal calculation
- Algorithm for maximizing coverage of rectangular a
- Why does array_unique sort the values?
- How to measure complexity of a string?
Before you go building your own external sort, you might look at the tools your operating system supplies. Windows has SORT.EXE, which works well enough on some text files, although it has ... idiosyncrasies. The GNU sort, too, works pretty well. You could give either of those a try on a subset of your data to see if they'll do what you need.
Otherwise . . .
The external sort is a pretty well-known algorithm. The general idea:
Assuming you have
n
items that are separated intok
blocks ofm
elements each (son = k*m
), the first part (steps 1-4) takes time proportional to k*(m log m).After completing steps 1-4, you have
k
sorted blocks ofm
items (or possiblyk-1
blocks ofm
items, and one block that has somewhat fewer items). Or, if you're sorting strings, you havek
blocks that are approximately the same size, but the number of strings in each block will vary.You now need to merge these sorted blocks. The typical way to do that is with a k-way merge.
You create a min-heap that contains the first item from each block. You then select the root item from the heap, which is the smallest item of all the blocks. You output that as the first item. Then, read the next item from the block that the smallest came from, and place it on the heap. That is:
This part of the algorithm is O(n log k), where
n
is the total number of items andk
is the number of blocks.As somebody else pointed out, one key to efficient external sorting is to reduce I/O. External storage is slow. The algorithm I described above does as little I/O as possible. Each item is read from external storage twice, and each item is written to external storage twice. Other algorithms that look simpler or faster at first blush end up being much slower when working with real data because they spend too much time doing I/O.
If you're interested in an implementation, I wrote a series of articles some time back about sorting a very large text file. The code is C#, but the description should allow you to translate to any language with little trouble.