If I have a very large list stored in external memory that needs to be sorted. Asumimg this list is too large for internal memory, what major factors should be considered in designing an external sorting algorithm?
问题:
回答1:
Before you go building your own external sort, you might look at the tools your operating system supplies. Windows has SORT.EXE, which works well enough on some text files, although it has ... idiosyncrasies. The GNU sort, too, works pretty well. You could give either of those a try on a subset of your data to see if they'll do what you need.
Otherwise . . .
The external sort is a pretty well-known algorithm. The general idea:
- Load as much of the data as possible into memory.
- Sort that block.
- Write that block to external memory.
- Repeat steps 1-3 until all all blocks have been sorted and stored.
- Merge the sorted blocks.
Assuming you have n
items that are separated into k
blocks of m
elements each (so n = k*m
), the first part (steps 1-4) takes time proportional to k*(m log m).
After completing steps 1-4, you have k
sorted blocks of m
items (or possibly k-1
blocks of m
items, and one block that has somewhat fewer items). Or, if you're sorting strings, you have k
blocks that are approximately the same size, but the number of strings in each block will vary.
You now need to merge these sorted blocks. The typical way to do that is with a k-way merge.
You create a min-heap that contains the first item from each block. You then select the root item from the heap, which is the smallest item of all the blocks. You output that as the first item. Then, read the next item from the block that the smallest came from, and place it on the heap. That is:
create heap
for each block
read item and add to heap
end for
while heap is not empty
remove smallest item from heap
write to output
read next item from block that contained smallest item
add to heap
end while
This part of the algorithm is O(n log k), where n
is the total number of items and k
is the number of blocks.
As somebody else pointed out, one key to efficient external sorting is to reduce I/O. External storage is slow. The algorithm I described above does as little I/O as possible. Each item is read from external storage twice, and each item is written to external storage twice. Other algorithms that look simpler or faster at first blush end up being much slower when working with real data because they spend too much time doing I/O.
If you're interested in an implementation, I wrote a series of articles some time back about sorting a very large text file. The code is C#, but the description should allow you to translate to any language with little trouble.