So I'm testing an calculating the probabilities of certain dice rolls, for a game. The base case if that rolling one 10sided die.
I did a million samples of this, and ended up with the following proportions:
Result
0 0.000000000000000%
1 10.038789961210000%
2 10.043589956410000%
3 9.994890005110000%
4 10.025289974710000%
5 9.948090051909950%
6 9.965590034409970%
7 9.990190009809990%
8 9.985490014509990%
9 9.980390019609980%
10 10.027589972410000%
These should of course all be 10%. There is a standard deviation of 0.0323207% in these results. that, to me, seems rather high. Is it just coincidence? As I understand it the random module accesses proper pseudo-random numbers. Ie ones from a method that pass the statistical tests to be random. Or are these pseudo-pseudo-random number generators
Should I be using cryptographic pseudo-random number generators? I'm fairly sure I don't need a true random number generator (see http://www.random.org/, http://en.wikipedia.org/wiki/Hardware_random_number_generator).
I am currently regenerating all my results with 1 billion samples, (cos why not, I have a crunchy server at my disposal, and some sleep to do)
I reran the OP's exercise with one billion iterations:
Here's the (reformatted) result:
See the other answers to this question for their excellent analysis.
From the
random
module documentation:From the Wikipedia article on the Mersenne Twister:
If you have an OS-specific randomness source, available through
os.urandom()
, then you can use therandom.SystemRandom()
class instead. Most of therandom
module functions are available as methods on that class. It perhaps would be more suitable for cryptographic purposes, quoting the docs again:Python 3.6 adds a
secrets
module with convenience methods to produce random data suitable for cryptographic purposes:Martijn's answer is a pretty succinct review of the random number generators that Python has access to.
If you want to check out the properties of the generated pseudo-random data, download
random.zip
from http://www.fourmilab.ch/random/, and run it on a big sample of random data. Especially the χ² (chi squared) test is very sensitive to randomness. For a sequence to be really random, the percentage from the χ² test should be between 10% and 90%.For a game I'd guess that the Mersenne Twister that Python uses internally should be sufficiently random (unless you're building an online casino :-).
If you want pure randomness, and if you are using Linux, you can read from
/dev/random
. This only produces random data from the kernel's entropy pool (which is gathered from the unpredictable times that interrupts arrive), so it will block if you exhaust it. This entropy is used to initialize (seed) the PRNG used by/dev/urandom
. On FreeBSD, the PRNG that supplies data for/dev/random
uses the Yarrow algorithm, which is generally regarded as being cryptographically secure.Edit: I ran some tests on bytes from
random.randint
. First creating a million random bytes:Then I ran the
ent
program from Fourmilab on it:Now for the χ² test, the further you get from 50%, the more suspect the data is. If one is very fussy, values <10% or >90% are deemed unacceptable. John Walker, author of
ent
calls this value "almost suspect".As a contrast, here is the same analysis of 10 MiB from FreeBSD's Yarrow prng that I ran earlier:
While there seems not much difference in the other data, the χ² precentage is much closer to 50%.
These results are very close to what you'd expect, and there's a simple calculation you can do to check that. If you roll 1,000,000 D10s and count the number of 1s (say) the mean of that random variable is 100,000 (number of trials * probability of success) and the variance is 90,000 (number of trials * probability of success * probability of failure), so the standard deviation is sqrt(90,000)=300. So you should expect to get something about 300 away from 100,000, i.e. 10% +/- 0.03%.
It is indeed normal for random numbers to come up imperfectly distributed with a good PRNG. However, the more numbers you generate, the less you should see that.
BTW, I'm getting a standard deviation of 0.03066, which is slightly lower than what you gave.
Yes, it is statistically random for all practical purposes. The random variation you saw is perfectly normal. In fact it would be a poor rng if it didn't have variation like that.
Since the period of the prng is 2**19937-1, you would need to generate more numbers than there are atoms in the universe before you see a nonrandom distribution. Note that if you generate 623 dimensional vectors, it becomes non random much sooner.