I'm familiar with the algorithm for reading a single random line from a file without reading the whole file into memory. I wonder if this technique can be extended to N random lines?
The use case is for a password generator which concatenates N random words pulled out of a dictionary file, one word per line (like /usr/share/dict/words
). You might come up with angela.ham.lewis.pathos
. Right now it reads the whole dictionary file into an array and picks N random elements from that array. I would like to eliminate the array, or any other in-memory storage of the file, and only read the file once.
(No, this isn't a practical optimization exercise. I'm interested in the algorithm.)
Update: Thank you all for your answers.
Answers fell into three categories: modifications of the full read algorithm, random seek, or index the lines and seek to them randomly.
The random seek is much faster, and constant with respect to file size, but distributes based on file size not on number of words. It also allows duplicates (that can be avoided but it makes the algorithm O(inf)). Here's my reimplementation of my password generator using that algorithm. I realize that by reading forward from the seek point, rather than backwards, it has an off-by-one error should the seek fall in the last line. Correcting is left as an exercise for the editor.
#!/usr/bin/perl -lw
my $Words = "/usr/share/dict/words";
my $Max_Length = 8;
my $Num_Words = 4;
my $size = -s $Words;
my @words;
open my $fh, "<", $Words or die $!;
for(1..$Num_Words) {
seek $fh, int rand $size, 0 or die $!;
<$fh>;
my $word = <$fh>;
chomp $word;
redo if length $word > $Max_Length;
push @words, $word;
}
print join ".", @words;
And then there's Guffa's answer, which was what I was looking for; an extension of the original algorithm. Slower, it has to read the whole file, but distributes by word, allows filtering without changing the efficiency of the algorithm and (I think) has no duplicates.
#!/usr/bin/perl -lw
my $Words = "/usr/share/dict/words";
my $Max_Length = 8;
my $Num_Words = 4;
my @words;
open my $fh, "<", $Words or die $!;
my $count = 0;
while(my $line = <$fh>) {
chomp $line;
$count++;
if( $count <= $Num_Words ) {
$words[$count-1] = $line;
}
elsif( rand($count) <= $Num_Words ) {
$words[rand($Num_Words)] = $line;
}
}
print join ".", @words;
Finally, the index and seek algorithm has the advantage of distributing by word rather than file size. The disadvantage is it reads the whole file and memory usage scales linearly with the number of words in the file. Might as well use Guffa's algorithm.
I'd say:
\n
. That's the number of lines - let's call thatL
L
, fetch their offsets and you're done.You'd use just a small array and read the whole file once + 2 lines afterwards.
You could do a 2 pass algorithm. First get the positions of each newline, pushing those positions into a vector. Then pick random items in that vector, call this i.
Read from the file at position v[i] to v[i+1] to get your line.
During the first pass you read the file with a small buffer as to not read it all into RAM at once.