What's the best way to return a random line in

2019-01-04 11:34发布

What's the best way to return a random line in a text file using C? It has to use the standard I/O library (<stdio.h>) because it's for Nintendo DS homebrew.

Clarifications:

  • Using a header in the file to store the number of lines won't work for what I want to do.
  • I want it to be as random as possible (the best being if each line has an equal probability of being chosen as every other line.)
  • The file will never change while the program is being run. (It's the DS, so no multi-tasking.)

8条回答
孤傲高冷的网名
2楼-- · 2019-01-04 11:42
  1. Get the length of the file.
  2. Pick a random position in the file.
  3. Seek to that position.
  4. Iterate forward until you find a newline character.
  5. If you don't find a newline character, seek back to the beginning.
  6. Use gets() to read the line.
查看更多
一纸荒年 Trace。
3楼-- · 2019-01-04 11:42

All you need do is generate one unscaled random number per line, while maintaining the maximum value for all the random numbers you generate. Whenever you update the maximum value, you overwrite the selected line with the current line.

At the end you get the line associated with the highest number rand() spat out, which should be equally probable among all your lines.

查看更多
Ridiculous、
4楼-- · 2019-01-04 11:43

Use a combination of Adam's random offset into the file approach and Mark's probability approach. Adam's method can get you randomly to a section of the file. Then you use Mark's approach to avoid preferring the larger strings. Mark's algorithm will prefer the first few strings from wherever it starts,

查看更多
爷的心禁止访问
5楼-- · 2019-01-04 11:48

Just a quick note about Mark Ransom's way of avoiding integer overflow: the DS has no FPU, so floating point division will be emulated in software and very slow. You'll want to avoid typecasting/promotion to float or double at all costs, if speed is a concern.

Here's a different way of avoiding integer overflow that avoids any floating point math:

if(rand() <= RAND_MAX / count)

The probabilities may be slightly skewed due to integer division, but this should certainly run much faster on a DS.

查看更多
Juvenile、少年°
6楼-- · 2019-01-04 11:56

Mark's answer is almost correct except for two issues:

  1. If a line is longer than length - 1 characters (including the newline), then the while loop will increment count at least twice for the same line: once for the first length - 1 characters, another for the next length - 1 characters, etc.
  2. The calculation of rand() * count can cause an integer overflow.

To solve the first problem, you can call fgets into a trash buffer until it returns NULL (indicating an I/O error or EOF with no data read) or the trash buffer contains a newline:

count = 0;
while (fgets(line, length, stream) != NULL)
{
    char *p = strchr(line, '\n');
    if (p != NULL) {
        assert(*p == '\n');
        *p = '\0'; // trim the newline
    }
    else { // haven't reached EOL yet. Read & discard the rest of the line.
#define TRASH_LENGTH 1024
        char trash[TRASH_LENGTH];
        while((p = fgets(trash, TRASH_LENGTH, stream)) != NULL) {
            if ((p = strchr(trash, '\n')) != NULL) // reached EOL
                break;
        }
    }
    assert(strchr(line, '\n') == NULL); // `line` does not contain a newline
    count++;
    // ...

The second problem can be solved with @tvanfosson's suggestion if floating-point arithmetic is not available:

int one_chance_in(size_t n)
{
    if (rand() % n == 0) // `rand` returns an integer in [0, `RAND_MAX`]
        return 1;
    else
        return 0;
}

But note that rand() % n is not a uniform, discrete random variable even if rand() is assumed to be one because the probability that rand() % n == 0 can be as much as 1/RAND_MAX higher than the desired probability 1/n. On my machine, RAND_MAX is 2147483647, so the difference is 4.66 × 10-10, but the C standard only requires that RAND_MAX be at least 32767 (3.05 × 10-5 difference).

Also, for anyone left wondering why this scheme works (as I was), it might be helpful to work through the calculation of the probability that the first line remains in keptline if there are m lines and generalize: In the first iteration of the loop, the probability that the first line is copied to keptline is 1/1. In the second iteration of the loop, the probability that the second line does not overwrite the first line is 1/2. In the third iteration, the probability that the third line does not overwrite the first line is 2/3. Continuing, the probability that the last line does not overwrite the first line is (m - 1)/m. Thus, the probability that the first line remains in keptline after iterating over all lines is:

1/1 × 1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the second line remains in keptline is:

1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the third line remains in keptline is:

1/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

Etc. They're all 1/m.

查看更多
太酷不给撩
7楼-- · 2019-01-04 11:57

I have an alternative solution. Since the platform is the DS, you'd probably not want to try to hold the file in memory. This reads the file twice. Once to count the lines and the 2nd time to find the line it wants. This would run slower than the other solutions suggested so far but it uses hardly any memory. I even wrote it in C for you (I omitted error handling):

main(int argc, char **argv)
{
    FILE *f;
    int nLines = 0;
    char line[1024];
    int randLine;
    int i;

    srand(time(0));
    f = fopen(argv[1], "r");

/* 1st pass - count the lines. */
    while(!feof(f))
    {
        fgets(line, 1024, f);
        nLines++;
    }

    randLine = rand() % nLines;
    printf("Chose %d of %d lines\n", randLine, nLines);

/* 2nd pass - find the line we want. */
    fseek(f, 0, SEEK_SET);
    for(i = 0; !feof(f) && i <= randLine; i++)
        fgets(line, 1024, f);

    printf("%s", line);
}

UPDATE: Oops, I should have read Brian R. Bondy's answer before I posted this but I was kinda obsessing over writing the code and didn't notice. This is almost the same except it doesn't store the line positions in an array. You could do it either way depending on how big the file is and whether speed is more important than saving memory.

查看更多
登录 后发表回答