What's the best way to return a random line in a text file using C? It has to use the standard I/O library (<stdio.h>
) because it's for Nintendo DS homebrew.
Clarifications:
- Using a header in the file to store the number of lines won't work for what I want to do.
- I want it to be as random as possible (the best being if each line has an equal probability of being chosen as every other line.)
- The file will never change while the program is being run. (It's the DS, so no multi-tasking.)
All you need do is generate one unscaled random number per line, while maintaining the maximum value for all the random numbers you generate. Whenever you update the maximum value, you overwrite the selected line with the current line.
At the end you get the line associated with the highest number rand() spat out, which should be equally probable among all your lines.
Use a combination of Adam's random offset into the file approach and Mark's probability approach. Adam's method can get you randomly to a section of the file. Then you use Mark's approach to avoid preferring the larger strings. Mark's algorithm will prefer the first few strings from wherever it starts,
Just a quick note about Mark Ransom's way of avoiding integer overflow: the DS has no FPU, so floating point division will be emulated in software and very slow. You'll want to avoid typecasting/promotion to float or double at all costs, if speed is a concern.
Here's a different way of avoiding integer overflow that avoids any floating point math:
The probabilities may be slightly skewed due to integer division, but this should certainly run much faster on a DS.
Mark's answer is almost correct except for two issues:
length - 1
characters (including the newline), then thewhile
loop will incrementcount
at least twice for the same line: once for the firstlength - 1
characters, another for the nextlength - 1
characters, etc.rand() * count
can cause an integer overflow.To solve the first problem, you can call
fgets
into a trash buffer until it returnsNULL
(indicating an I/O error or EOF with no data read) or the trash buffer contains a newline:The second problem can be solved with @tvanfosson's suggestion if floating-point arithmetic is not available:
But note that
rand() % n
is not a uniform, discrete random variable even ifrand()
is assumed to be one because the probability thatrand() % n == 0
can be as much as 1/RAND_MAX
higher than the desired probability 1/n
. On my machine,RAND_MAX
is 2147483647, so the difference is 4.66 × 10-10, but the C standard only requires thatRAND_MAX
be at least 32767 (3.05 × 10-5 difference).Also, for anyone left wondering why this scheme works (as I was), it might be helpful to work through the calculation of the probability that the first line remains in
keptline
if there are m lines and generalize: In the first iteration of the loop, the probability that the first line is copied tokeptline
is 1/1. In the second iteration of the loop, the probability that the second line does not overwrite the first line is 1/2. In the third iteration, the probability that the third line does not overwrite the first line is 2/3. Continuing, the probability that the last line does not overwrite the first line is (m - 1)/m. Thus, the probability that the first line remains inkeptline
after iterating over all lines is:The probability that the second line remains in
keptline
is:The probability that the third line remains in
keptline
is:Etc. They're all 1/m.
I have an alternative solution. Since the platform is the DS, you'd probably not want to try to hold the file in memory. This reads the file twice. Once to count the lines and the 2nd time to find the line it wants. This would run slower than the other solutions suggested so far but it uses hardly any memory. I even wrote it in C for you (I omitted error handling):
UPDATE: Oops, I should have read Brian R. Bondy's answer before I posted this but I was kinda obsessing over writing the code and didn't notice. This is almost the same except it doesn't store the line positions in an array. You could do it either way depending on how big the file is and whether speed is more important than saving memory.