What's the best way to return a random line in a text file using C? It has to use the standard I/O library (<stdio.h>
) because it's for Nintendo DS homebrew.
Clarifications:
- Using a header in the file to store the number of lines won't work for what I want to do.
- I want it to be as random as possible (the best being if each line has an equal probability of being chosen as every other line.)
- The file will never change while the program is being run. (It's the DS, so no multi-tasking.)
Read each line, and use a random number to choose whether to keep that line or ignore it. For the first line, you want odds of 1:1 to keep; for the second, you want odds of 1:2, etc.
I haven't verified that this has the proper random qualities, but it seems right at first glance.
It has been pointed out that integer overflow would quickly become a problem with the way the comparison is coded, and I had independently reached the same conclusion myself. There are probably many ways to fix it, but this is the first that comes to mind:
This method is good because:
i) You can keep generating random lines at no big cost
ii) You only have to read the file a total of 1 time + 1 line at a time per random line you want. The excess read data is only equal to the size of the file.
iii) It gives each line a fair chance no matter what its position is in the file.
iv) It gives each line a fair chance no matter what its length is in the file.
The suggestion:
I would suggest a 2 pass algorithm. Well really it's a 1 pass + N lines. Where N is the number of random lines you want.
The first pass you would use to calculate how many lines and the start positions of each line.
You then take a random number from 0 to the number of lines minus 1. Use that random number, which is your line index, get the start position for that line index. Seek to that position.
You then have only 1 more read needed, and you know the exact size. (until the start index of the next line)
How to store the number of lines and the index of each line:
To store the number of lines, you can obviously just use an int.
If you can use a vector then you can add each line index into the vector. If not you can just create an array of ints with the max number of lines you think there will be. Then index into that array.
Other answers:
Another answer mentioned that you can pick a random number from 1 to the size of the file, and then use the closest newline. But this won't work. For example you might have 1 line that is really long and the others that are not that long. In that case you would have an uneven distribution.