The UNIX sort
command can sort a very large file like this:
sort large_file
How is the sort algorithm implemented?
How come it does not cause excessive consumption of memory?
The UNIX sort
command can sort a very large file like this:
sort large_file
How is the sort algorithm implemented?
How come it does not cause excessive consumption of memory?
WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.
Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !
See also: "Sorting large files faster with a shell script"
Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem. Key parameters on Ubuntu are
The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.
I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.
The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.
Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).