Suppose I have X GB of RAM space available and I need to sort a huge array of data (much greater the all the available memory. It's stored on the hard drive). Could you give a hint, how that could be accomplished?
相关问题
- How to toggle on Order in ReactJS
- PHP Recursively File Folder Scan Sorted by Modific
- What uses more memory in c++? An 2 ints or 2 funct
- Memory for python.exe on Windows 7 python 32 - Num
- Change sort order of strings includes with a speci
相关文章
- Why are memory addresses incremented by 4 in MIPS?
- Sort TreeView Automatically Upon Adding Nodes
- Is my heap fragmented
- Why does array_unique sort the values?
- Sorting a data stream before writing to file in no
- Sort a List by a property and then by another
- What is a Deterministic Quicksort?
- How to access the sorted rows in an Angular UI Gri
What you are looking for is an external sort.
http://en.wikipedia.org/wiki/External_sorting
Let us assume you have a huge file H and limit memory M.
I have two solutions.
Solution 1:
step 1: Read M from the file and write it into a temp buffer.
step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values.
step 3: Create a temp file and write the buffer into the temp file. Save the name of the temp file.
step 4: Repeat step 1 to 3 until read file H out, and sort M out and save all temp files.
step 5: Merge all temp files to a new file. Create a new file,and open all the temp files,put the file handles into a array.
step 6: Find the minimum number from the set of numbers currently pointed to by the file read pointer. You can use normal way to do that, compare every number, and use a temp variable to remember it (time complexity is O(n). Or you can create a priorityQueue,and a map, the key of the map is the number, and the value of the map is the sequence of the file handles.(time complexity is O(lgn),the second way wastes more memory, but performance is better,if you want a better way, you can use a int to replace list using bitmap, becase the temp file names are sequeced.
step 7: Save the number to the new file.
step 8: Read another number from the file that contained the minimum number at step 6.
step 9: Repeat step 7 to 8 until all numbers from all the temp files have been processed.
Solution 2: step 1 : Create N temp files, the range of the numbers of every temp files are different.(For example,the range of temp_file_1 is from 0 to 1000000, and the next temp file is from 1000001 to 2000000...)
step 2: Read m from the H file and write the number into the different temp files until nothing can read from file H.
step 3: Sort every temp files.
step 4: Create a new file, merge all temp files into this new file one by one.
What's the difference between the solutions. According to solution 1, every number is sorted once and is compared many times(step 5), but read and write only twice. about solution 2, no merge processing, but every single number is read and wrote three times.
Practically, if you don't want to write too much code, and disk usage is not an issue, put you data into a database with a proper index, and then just
select * order by
from there.You are looking for external sorting. The largest cost in these scenarios is often disk IO. So the trick is to use an algorithm that minimises disk IO. The usual approach is to read suitably large chunks in to memory, sort those chunks, saving them back to disk, then merge the sorted chunks.
A search for "external sorting" or "sort merge" and your technologies of choice should give some good results.
I would image you might construct some kind of index system that you can separate your sorted data into.
Imagine shelfs in a library. Go through 1/x of the data, sorting all elements onto the shelfs, and cache each shelf to an individual file on the disk. Then sort the next 1/x of data, write it back to the disk, ect. Once you have all your data in shelves go through and sort each shelf individually, then finally, merge them into one nice sorted file.
You can sort the many huge files (the sorted result can be terabytes and bigger) with ZZZServer it is free for non-commercial use:
After sorting the result is saved in
Your input files must be encoded in the UTF-8 or ASCII format!
The ZZZServer using about 1MB RAM on sorting big files!