可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Consider an algorithm to test the probability that a certain number is picked from a set of N unique numbers after a specific number of tries (for example, with N=2, what's the probability in Roulette (without 0) that it takes X tries for Black to win?).

The correct distribution for this is pow(1-1/N,X-1)*(1/N).

However, when I test this using the following code, there is always a deep ditch at X=31, independently from N, and independently from the seed.

Is this an intrinsic flaw that cannot be prevented due to the implementation specifics of the PRNG in use, is this a real bug, or am I overlooking something obvious?

// C

#include <sys/times.h>
#include <math.h>
#include <stdio.h>

int array[101];
void main(){

    int nsamples=10000000;
    double breakVal,diffVal;
    int i,cnt;

    // seed, but doesn't change anything
    struct tms time;
    srandom(times(&time));

    // sample
    for(i=0;i<nsamples;i++){
        cnt=1;
        do{
            if((random()%36)==0) // break if 0 is chosen
                break;
            cnt++;
        }while(cnt<100);
        array[cnt]++;
    }

    // show distribution
    for(i=1;i<100;i++){
        breakVal=array[i]/(double)nsamples; // normalize
        diffVal=breakVal-pow(1-1/36.,i-1)*1/36.; // difference to expected value
        printf("%d %.12g %.12g\n",i,breakVal,diffVal);
    }
}

Tested on an up-to-date Xubuntu 12.10 with libc6 package 2.15-0ubuntu20 and Intel Core i5-2500 SandyBridge, but I discovered this already a few years ago on an older Ubuntu machine.

I also tested this on Windows 7 using Unity3D/Mono (not sure which Mono version, though), and here the ditch happens at X=55 when using System.Random, while Unity's builtin Unity.Random has no visible ditch (at least not for X<100).

The distribution:

The differences:

回答1:

This is due to glibc's random() function not being random enough. According to this page, for the random numbers returned by random(), we have:

o_i = (o_i-3 + o_i-31) % 2^31

or:

o_i = (o_i-3 + o_i-31 + 1) % 2^31.

Now take x_i = o_i % 36, and suppose the first equation above is the one used (this happens with a 50% chance for each number). Now if x_i-31=0 and x_i-3!=0, then the chance that x_i=0 is less than 1/36. This is because 50% of the time o_i-31 + o_i-3 will be less than 2^31, and when that happens,

x_i = o_i % 36 = (o_i-3 + o_i-31) % 36 = o_i-3 % 36 = x_i-3,

which is nonzero. This causes the ditch you see 31 samples after a 0 sample.

回答2:

What's being measured in this experiment is the interval between successful trials of a Bernoulli experiment, where success is defined as random() mod k == 0 for some k (36 in the OP). Unfortunately, it is marred by the fact that the implementation of random() means that the Bernoulli trials are not statistically independent.

We'll write rnd_i for the i^th output of `random()' and we note that:

rnd_i = rnd_i-31 + rnd_i-3 with probability 0.75

rnd_i = rnd_i-31 + rnd_i-3 + 1 with probability 0.25

(See below for a proof outline.)

Let's suppose rnd_i-31 mod k == 0 and we're currently looking at rnd_i. Then it must be the case that rnd_i-3 mod k ≠ 0, because otherwise we would have counted the cycle as being length k-3.

But (most of the time) (mod k): rnd_i = rnd_i-31 + rnd_i-3 = rnd_i-3 ≠ 0.

So the current trial is not statistically independent of the previous trials, and the 31^st trial after a success is much less likely to succeed than it would in an unbiased series of Bernoulli trials.

The usual advice in using linear-congruential generators, which doesn't actually apply to the random() algorithm, is to use the high-order bits instead of the low-order bits, because high-order bits are "more random" (that is, less correlated with successive values). But that won't work in this case either, because the above identities hold equally well for the function high log k bits as for the function mod k == low log k bits.

In fact, we might expect a linear-congruential generator to work better, particularly if we use the high-order bits of the output, because although the LCG is not particularly good at Monte Carlo simulations, it does not suffer from the linear feedback of random().

random algorithm, for the default case:

Let state be a vector of unsigned longs. Initialize state₀...state₃₀ using a seed, some fixed values, and a mixing algorithm. For simplicity, we can consider the state vector to be infinite, although only the last 31 values are used so it's actually implemented as a ring buffer.

To generate rnd_i: (Note: ⊕ is addition mod 2³².)



state_i = state_i-31 ⊕ state_i-3

rnd_i = (state_i - (state_i mod 2)) / 2

Now, note that:

(i + j) mod 2 = i mod 2 + j mod 2    if i mod 2 == 0 or j mod 2 == 0

(i + j) mod 2 = i mod 2 + j mod 2 - 2 if i mod 2 == 1 and j mod 2 == 1

If i and j are uniformly distributed, the first case will occur 75% of the time, and the second case 25%.

So, by substitution in the generation formula:

rnd_i = (state_i-31 ⊕ state_i-3 - ((state_i-31 + state_i-3) mod 2)) / 2

     = ((state_i-31 - (state_i-31 mod 2)) ⊕ (state_i-3 - (state_i-3 mod 2))) / 2 or

     =  ((state_i-31 - (state_i-31 mod 2)) ⊕ (state_i-3 - (state_i-3 mod 2)) + 2) / 2

The two cases can be further reduced to:

rnd_i = rnd_i-31 ⊕ rnd_i-3

rnd_i = rnd_i-31 ⊕ rnd_i-3 + 1

As above, the first case occurs 75% of the time, assuming that rnd_i-31 and rnd_i-3 are independently drawn from a uniform distribution (which they're not, but it's a reasonable first approximation).

回答3:

As others pointed out, random() is not random enough.

Using the higher bits instead of the lower ones does not help in this case. According to the manual (man 3 rand), old implementations of rand() had a problem in the lower bits. That's why random() is recommended instead. Though, the current implementation of rand() uses the same generator as random().

I tried the recommended correct use of the old rand():

if ((int)(rand()/(RAND_MAX+1.0)*36)==0)

...and got the same deep ditch at X=31

Interstingly, if I mix rand()'s numbers with another sequence, I get rid of the ditch:

unsigned x=0;
//...

        x = (179*x + 79) % 997;
        if(((rand()+x)%36)==0)

I am using an old Linear Congruential Generator. I chose 79, 179 and 997 at random from a primes table. This should generate a repeating sequence of length 997.

That said, this trick probably introduced some non-randomness, some footprint... The resulting mixed sequence will surely fail other statistical tests. x never takes the same value in consecutive iterations. Indeed, it takes exactly 997 iterations to repeat every value.

''[..] random numbers should not be generated with a method chosen at random. Some theory should be used." (D.E.Knuth, "The Art of Computer Programming", vol.2)

For simulations, if you want to be sure, use the Mersenne Twister