How do I read N random lines out of a file without

2019-03-13 10:15发布

站内文章 / 后端开发

33 0

叛逆

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm familiar with the algorithm for reading a single random line from a file without reading the whole file into memory. I wonder if this technique can be extended to N random lines?

The use case is for a password generator which concatenates N random words pulled out of a dictionary file, one word per line (like /usr/share/dict/words). You might come up with angela.ham.lewis.pathos. Right now it reads the whole dictionary file into an array and picks N random elements from that array. I would like to eliminate the array, or any other in-memory storage of the file, and only read the file once.

(No, this isn't a practical optimization exercise. I'm interested in the algorithm.)

Update: Thank you all for your answers.

Answers fell into three categories: modifications of the full read algorithm, random seek, or index the lines and seek to them randomly.

The random seek is much faster, and constant with respect to file size, but distributes based on file size not on number of words. It also allows duplicates (that can be avoided but it makes the algorithm O(inf)). Here's my reimplementation of my password generator using that algorithm. I realize that by reading forward from the seek point, rather than backwards, it has an off-by-one error should the seek fall in the last line. Correcting is left as an exercise for the editor.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my $size = -s $Words;

my @words;
open my $fh, "<", $Words or die $!;

for(1..$Num_Words) {
    seek $fh, int rand $size, 0 or die $!;
    <$fh>;
    my $word = <$fh>;
    chomp $word;
    redo if length $word > $Max_Length;
    push @words, $word;
}
print join ".", @words;

And then there's Guffa's answer, which was what I was looking for; an extension of the original algorithm. Slower, it has to read the whole file, but distributes by word, allows filtering without changing the efficiency of the algorithm and (I think) has no duplicates.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my @words;
open my $fh, "<", $Words or die $!;
my $count = 0;
while(my $line = <$fh>) {
    chomp $line;
    $count++;
    if( $count <= $Num_Words ) {
        $words[$count-1] = $line;
    }
    elsif( rand($count) <= $Num_Words ) {
        $words[rand($Num_Words)] = $line;
    }
}

print join ".", @words;

Finally, the index and seek algorithm has the advantage of distributing by word rather than file size. The disadvantage is it reads the whole file and memory usage scales linearly with the number of words in the file. Might as well use Guffa's algorithm.

回答1:

The algorithm is not implemented in a very good and clear way in that example... Some pseudo code that better explains it would be:

cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if random(1 to cnt) = 1 {
      result = line
   }
}

As you see, the idea is that you read each line in the file and calculate the probability that the line should be the one chosen. After reading the first line the probability is 100%, after reading the second line the probability is 50%, and so on.

This can be expanded to picking N items by keeping an array with the size N instead of a single variable, and calculate the probability for a line to replace one of the current ones in the array:

var result[1..N]
cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if cnt <= N {
      result[cnt] = line
   } else if random(1 to cnt) <= N {
      result[random(1 to N)] = line
   }
}

Edit:
Here's the code implemented in C#:

public static List<string> GetRandomLines(string path, int count) {
    List<string> result = new List<string>();
    Random rnd = new Random();
    int cnt = 0;
    string line;
    using (StreamReader reader = new StreamReader(path)) {
        while ((line = reader.ReadLine()) != null) {
            cnt++;
            int pos = rnd.Next(cnt);
            if (cnt <= count) {
                result.Insert(pos, line);
            } else {
                if (pos < count) {
                    result[pos] = line;
                }
            }
        }
    }
    return result;
}

I made a test by running the method 100000 times, picking 5 lines out of 20, and counted the occurances of the lines. This is the result:

As you see, the distribution is as good as you could ever want. :)

(I moved the creation of the Random object out of the method when running the test, to avoid seeding problems as the seed is taken from the system clock.)

Note:
You might want to scramble the order in the resulting array if you want them to be randomly ordered. As the first N lines are placed in order in the array, they are not randomly placed if they remain at the end. For exmaple if N is three or larger and the third line is picked, it will always be at the third position in the array.

Edit 2:
I changed the code to use a List<string> instead of a string[]. That makes it easy to insert the first N items in a random order. I updated the test data from a new test run, so that you can see that the distribution is still good.

回答2:

Now my Perl is not what used to be, but trusting the implicit claim on your reference (that the distribution of line numbers thus selected is uniform), it seems this should work:

srand;
(rand($.) < 1 && ($line1 = $_)) || (rand($.) <1 && ($line2 = $_)) while <>;

Just like the original algorithm, this is one-pass and constant memory.

Edit I just realized you need N, and not 2. You can repeat the OR-ed expression N times if you know N in advance.

回答3:

Quite the first time I see some Perl code ... it is incredible unreadable ... ;) But that should not matter. Why don't you just repeat the cryptic line N times?

If I would have to write this, I would just seek a random position in the file, read to the end of the line (the next newline), and then read one line up to the next newline. Add some error handling if you just seeked into the last line, repeat all this N times and you are done. I guess

srand;
rand($.) < 1 && ($line = $_) while <>;

is the Perl way to do such a single step. You could also read backwards from the initial position up to the priviouse newline or the begining of the file and then read a line forward again. But this doesn't really matter.

UPDATE

I have to admit that seeking somewhere into the file will not generate a perfect uniform distribution because of the different line lengths. If this fluctuation matters depends on the usage scenario, of course.

If you need a perfect uniform distribution, you need to read the whole file at least once to get the number of lines. In this case the algorithm given by Guffa is probably the cleverest solution because it requires reading the file exactly once.

回答4:

If you don't need to do it within the scope of Perl, shuf is a really nice command-line utility for this. To do what you're looking to do:

$ shuf -n N file > newfile

回答5:

Quick and dirty bash

function randomLine {
  numlines=`wc -l $1| awk {'print $1'}`
  t=`date +%s`
  t=`expr $t + $RANDOM`
  a=`expr $t % $numlines + 1`
  RETURN=`head -n $a $1|tail -n 1`
  return 0
}

randomLine test.sh
echo $RETURN

回答6:

Pick a random point in the file, look backwards for previous EOL, search forward for current EOL, and return the line.

FILE * file = fopen("words.txt");
int fs = filesize("words.txt");
int ptr = rand(fs); // 0 to fs-1
int start = min(ptr - MAX_LINE_LENGTH, 0);
int end = min(ptr + MAX_LINE_LENGTH, fs - 1);
int bufsize = end - start;

fseek(file, start);
char *buf = malloc(bufsize);
read(file, buf, bufsize);

char *startp = buf + ptr - start;
char *finp = buf + ptr - start + 1;

while (startp > buf  && *startp != '\n') {
    startp--;
}

while (finp < buf + bufsize && *finp != '\n') {
    finp++;
}

*finp = '\0';
startp++;
return startp;

Lots of one off errors and crap in there, bad memory management, and other horrors. If this actually compiles, you get a nickel. (Please send self addressed stamped envelope and $5 handling to receive free nickle.)

But you should get the idea.

Longer lines statistically have a higher chance of being selected than shorter lines. But the run time of this is effectively constant regardless of file size. If you have a lot of words of mostly similar length, the statisticians won't be happy (they never are anyway), but in practice it will be close enough.

回答7:

I'd say:

Read the file and search for the amount of \n. That's the number of lines - let's call that L
Store their positions in a small array in memory
Get two random lines lower than L, fetch their offsets and you're done.

You'd use just a small array and read the whole file once + 2 lines afterwards.

回答8:

You could do a 2 pass algorithm. First get the positions of each newline, pushing those positions into a vector. Then pick random items in that vector, call this i.

Read from the file at position v[i] to v[i+1] to get your line.

During the first pass you read the file with a small buffer as to not read it all into RAM at once.