How do I read N random lines out of a file without

2019-03-13 09:47发布

I'm familiar with the algorithm for reading a single random line from a file without reading the whole file into memory. I wonder if this technique can be extended to N random lines?

The use case is for a password generator which concatenates N random words pulled out of a dictionary file, one word per line (like /usr/share/dict/words). You might come up with angela.ham.lewis.pathos. Right now it reads the whole dictionary file into an array and picks N random elements from that array. I would like to eliminate the array, or any other in-memory storage of the file, and only read the file once.

(No, this isn't a practical optimization exercise. I'm interested in the algorithm.)

Update: Thank you all for your answers.

Answers fell into three categories: modifications of the full read algorithm, random seek, or index the lines and seek to them randomly.

The random seek is much faster, and constant with respect to file size, but distributes based on file size not on number of words. It also allows duplicates (that can be avoided but it makes the algorithm O(inf)). Here's my reimplementation of my password generator using that algorithm. I realize that by reading forward from the seek point, rather than backwards, it has an off-by-one error should the seek fall in the last line. Correcting is left as an exercise for the editor.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my $size = -s $Words;

my @words;
open my $fh, "<", $Words or die $!;

for(1..$Num_Words) {
    seek $fh, int rand $size, 0 or die $!;
    <$fh>;
    my $word = <$fh>;
    chomp $word;
    redo if length $word > $Max_Length;
    push @words, $word;
}
print join ".", @words;

And then there's Guffa's answer, which was what I was looking for; an extension of the original algorithm. Slower, it has to read the whole file, but distributes by word, allows filtering without changing the efficiency of the algorithm and (I think) has no duplicates.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my @words;
open my $fh, "<", $Words or die $!;
my $count = 0;
while(my $line = <$fh>) {
    chomp $line;
    $count++;
    if( $count <= $Num_Words ) {
        $words[$count-1] = $line;
    }
    elsif( rand($count) <= $Num_Words ) {
        $words[rand($Num_Words)] = $line;
    }
}

print join ".", @words;

Finally, the index and seek algorithm has the advantage of distributing by word rather than file size. The disadvantage is it reads the whole file and memory usage scales linearly with the number of words in the file. Might as well use Guffa's algorithm.

8条回答
The star\"
2楼-- · 2019-03-13 10:10

If you don't need to do it within the scope of Perl, shuf is a really nice command-line utility for this. To do what you're looking to do:

$ shuf -n N file > newfile

查看更多
何必那么认真
3楼-- · 2019-03-13 10:12

Quite the first time I see some Perl code ... it is incredible unreadable ... ;) But that should not matter. Why don't you just repeat the cryptic line N times?

If I would have to write this, I would just seek a random position in the file, read to the end of the line (the next newline), and then read one line up to the next newline. Add some error handling if you just seeked into the last line, repeat all this N times and you are done. I guess

srand;
rand($.) < 1 && ($line = $_) while <>;

is the Perl way to do such a single step. You could also read backwards from the initial position up to the priviouse newline or the begining of the file and then read a line forward again. But this doesn't really matter.

UPDATE

I have to admit that seeking somewhere into the file will not generate a perfect uniform distribution because of the different line lengths. If this fluctuation matters depends on the usage scenario, of course.

If you need a perfect uniform distribution, you need to read the whole file at least once to get the number of lines. In this case the algorithm given by Guffa is probably the cleverest solution because it requires reading the file exactly once.

查看更多
Summer. ? 凉城
4楼-- · 2019-03-13 10:14

Quick and dirty bash

function randomLine {
  numlines=`wc -l $1| awk {'print $1'}`
  t=`date +%s`
  t=`expr $t + $RANDOM`
  a=`expr $t % $numlines + 1`
  RETURN=`head -n $a $1|tail -n 1`
  return 0
}

randomLine test.sh
echo $RETURN
查看更多
beautiful°
5楼-- · 2019-03-13 10:16

Now my Perl is not what used to be, but trusting the implicit claim on your reference (that the distribution of line numbers thus selected is uniform), it seems this should work:

srand;
(rand($.) < 1 && ($line1 = $_)) || (rand($.) <1 && ($line2 = $_)) while <>;

Just like the original algorithm, this is one-pass and constant memory.

Edit I just realized you need N, and not 2. You can repeat the OR-ed expression N times if you know N in advance.

查看更多
放荡不羁爱自由
6楼-- · 2019-03-13 10:17

Pick a random point in the file, look backwards for previous EOL, search forward for current EOL, and return the line.

FILE * file = fopen("words.txt");
int fs = filesize("words.txt");
int ptr = rand(fs); // 0 to fs-1
int start = min(ptr - MAX_LINE_LENGTH, 0);
int end = min(ptr + MAX_LINE_LENGTH, fs - 1);
int bufsize = end - start;

fseek(file, start);
char *buf = malloc(bufsize);
read(file, buf, bufsize);

char *startp = buf + ptr - start;
char *finp = buf + ptr - start + 1;

while (startp > buf  && *startp != '\n') {
    startp--;
}

while (finp < buf + bufsize && *finp != '\n') {
    finp++;
}

*finp = '\0';
startp++;
return startp;

Lots of one off errors and crap in there, bad memory management, and other horrors. If this actually compiles, you get a nickel. (Please send self addressed stamped envelope and $5 handling to receive free nickle.)

But you should get the idea.

Longer lines statistically have a higher chance of being selected than shorter lines. But the run time of this is effectively constant regardless of file size. If you have a lot of words of mostly similar length, the statisticians won't be happy (they never are anyway), but in practice it will be close enough.

查看更多
孤傲高冷的网名
7楼-- · 2019-03-13 10:32

The algorithm is not implemented in a very good and clear way in that example... Some pseudo code that better explains it would be:

cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if random(1 to cnt) = 1 {
      result = line
   }
}

As you see, the idea is that you read each line in the file and calculate the probability that the line should be the one chosen. After reading the first line the probability is 100%, after reading the second line the probability is 50%, and so on.

This can be expanded to picking N items by keeping an array with the size N instead of a single variable, and calculate the probability for a line to replace one of the current ones in the array:

var result[1..N]
cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if cnt <= N {
      result[cnt] = line
   } else if random(1 to cnt) <= N {
      result[random(1 to N)] = line
   }
}

Edit:
Here's the code implemented in C#:

public static List<string> GetRandomLines(string path, int count) {
    List<string> result = new List<string>();
    Random rnd = new Random();
    int cnt = 0;
    string line;
    using (StreamReader reader = new StreamReader(path)) {
        while ((line = reader.ReadLine()) != null) {
            cnt++;
            int pos = rnd.Next(cnt);
            if (cnt <= count) {
                result.Insert(pos, line);
            } else {
                if (pos < count) {
                    result[pos] = line;
                }
            }
        }
    }
    return result;
}

I made a test by running the method 100000 times, picking 5 lines out of 20, and counted the occurances of the lines. This is the result:

25105
24966
24808
24966
25279
24824
25068
24901
25145
24895
25087
25272
24971
24775
25024
25180
25027
25000
24900
24807

As you see, the distribution is as good as you could ever want. :)

(I moved the creation of the Random object out of the method when running the test, to avoid seeding problems as the seed is taken from the system clock.)

Note:
You might want to scramble the order in the resulting array if you want them to be randomly ordered. As the first N lines are placed in order in the array, they are not randomly placed if they remain at the end. For exmaple if N is three or larger and the third line is picked, it will always be at the third position in the array.

Edit 2:
I changed the code to use a List<string> instead of a string[]. That makes it easy to insert the first N items in a random order. I updated the test data from a new test run, so that you can see that the distribution is still good.

查看更多
登录 后发表回答