How can I get n
random lines from very large files that can't fit in memory.
Also it would be great if I could add filters before or after the randomization.
update 1
in my case the specs are :
- > 100 million lines
- > 10GB files
- usual random batch size 10000-30000
- 512RAM hosted ubuntu server 14.10
so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem
In such limiting factors, the following approach will be better.
For this you need a tool that can seek in files, for example
perl
.Save the above into some file such "randlines.pl" and use it as:
e.g.
The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).
Here's a wee bash function for you. It grabs, as you say, a "batch" of lines, with a random start point within a file.
Edit the
echo
lines as required.This solution has the advantage of fewer pipes, less resource-intensive pipes (i.e. no
| sort ... |
), less platform dependence (i.e. nosort -R
which is GNU-sort-specific).Note that this relies on Bash's
$RANDOM
variable, which may or may not actually be random. Also, it will miss lines if your source file contains more than 32768^2 lines, and there's an failure edge case if the number of lines you've specificed (N) is >1 and the random start point is less than N lines from the beginning. Solving that is left as an exercise for the reader. :)UPDATE #1:
mklement0 asks an excellent question in comments about potential performance issues with the
head ... | tail ...
approach. I honestly don't know the answer, but I would hope that bothhead
andtail
are optimized sufficiently that they wouldn't buffer ALL input prior to displaying their output.On the off chance that my hope is unfulfilled, here's an alternative. It's an awk-based "sliding window" tail. I'll embed it in the earlier function I wrote so you can test it if you want.
The embedded awk script replaces the
head ... | tail ...
pipeline in the previous version of the function. It works as follows:The result is that the awk process shouldn't grow its memory footprint because the output array gets trimmed as fast as it's built.
NOTE: I haven't actually tested this with your data.
UPDATE #2:
Hrm, with the update to your question that you want N random lines rather than a block of lines starting at a random point, we need a different strategy. The system limitations you've imposed are pretty severe. The following might be an option, also using awk, with random numbers still from Bash:
This works by feeding a list of random line numbers into awk as a "first" file, then having awk print lines from the "second" file whose line numbers were included in the "first" file. It uses
wc
to determine the upper limit of the random numbers to generate. That means you'll be reading this file twice. If you have another source for the number of lines in the file (a database for example), do plug it in here. :)A limiting factor might be the size of that first file, which must be loaded into memory. I believe that the 30000 random numbers should only take about 170KB of memory, but how the array gets represented in RAM depends on the implementation of awk you're using. (Though usually, awk implementations (including Gawk in Ubuntu) are pretty good at keeping memory wastage to a minimum.)
Does this work for you?
usage:
get 1000 random sample
line has numbers
no mike and jane
I've used
rl
for line randomnisation and found it to perform quite well. Not sure how it scales to your case (you'd simply do e.g.rl FILE | head -n NUM
). You can get it here: http://arthurdejong.org/rl/Simple (but slow) solution
or if you want, save the following into a
randlines
scriptand use it as:
How it works:
The
sort -R
sorts the file by the calculated random hashes for each line, so you will get an randomised order of lines, therefore the first N lines are random lines.Because the hashing produces the same hash for the same line, duplicate lines are not treated as different. Is possible eliminate the duplicate lines adding the line number (with
nl
), so the sort will never got an exact duplicate. After thesort
removing the added line numbers.example:
prints in subsequent runs:
demo with duplicate lines:
in subsequent runs print:
Finally, if you want could use, instead of the
cut
sed
too: