random over a range, is number bias present for ne

2019-01-24 23:12发布

问题:

reading from various other SO questions, when using rand() % N you may happen to modify the bias for the pseudo number you get, so you usually have to introduce some range handling.

However in all cases rand() was always mentioned, and not the newer random() or arcrandom4() functions or the native C++11 methods. What happens when you run these routines over a set? Do you get a bias like rand()?

Thanks.

回答1:

What happens when you run these routines over a set? Do you get a bias like rand()?

The answer is: this depends on the relation between the size of range returned by the generator and the divisor in modulo operation. If divisor not evenly divides the range then distribution will be skewed. The bias ratio is in the range [ 1, 2], where 1 means no bias ( as for uniform distribution) and the bias increases with divisor. Regarding arcrandom4() this translates to skewed distribution obtained in all cases when modulo divisor is not an even divisor of 2^32. The rationale behind it is explained below.


Introduction. The bias

Imagine that we are trying to simulate uniform int distribution over interval [0, 99] with

int x = rand() % 100;

Operator % makes the probability distribution of X skewed because RAND_MAX which is maximum value for rand() can be not equal to k * 100 + 99. This results in that if you imagine all 100-length parts of 0-RAND_MAX range then you can see that last part will probably not produce a full range 0-99. Therefore you have more numbers that generates 0, 1, 2..., p but not necessary p + 1, ..., 98, 99 ( 1 more occurrence for each number in 0, 1, 2, ..., p). The inaccuracy of this approach increases with bigger divisor that not divides the range evenly and maximum bias compared to uniform distribution is equal 2.

In the following sections below we show that the bias measured as a ratio of probability of getting number from [ 0, p] to probability of number from [ p + 1, n] is equal to ( k + 1) / k and we confirm this with 2 examples.


Formula

We will show what exactly is the bias introduced by operation modulo ( operation that is applied to generator of uniform distribution in order to trim the output range). We will operate in terms of formula

x = rand() % ( n + 1)

where rand() is some generator and ( n + 1) is divisor in modulo operation. The picture below shows our standpoint:

We can see how numbers in range [ 0, n] are divided into these that repeat k + 1 times (numbers [ 0, p]) and these that repeats k times ( numbers [ p + 1, n]) in a single trial, which is "take the number from the distribution obtained by x = rand() % (n+1)". The p is defined as a remainder when dividing the maximum number ( i.e. Rand_MAX) given by the generator by the ( n + 1) which is the size of desired range:

p = ( N - 1) % ( n + 1)

N - 1 = k * ( n + 1) + p

and the k is the quotient

k = ( N - 1 - p) / ( n + 1)

In a single trial there are

( p + 1) * ( k + 1) + ( n - p) * k =

= p + 1 + k( n + 1) = N

possible outcomes. Thus the probability of receiving the element that repeats k times is k / N. Let's denote

f_0 = ( k + 1) / N, probability for each element from [ 0, p]

f_1 = k / N, probability for each element from [ p + 1, n]

Let's say that we will express the bias of sampling from this, transformed distribution over the uniform distribution as the ratio of probability of element that belongs to [ 0, p] to probability of element from the range [ p + 1, n]:

bias = f_0 / f_1 = ( k + 1) / k

So, are numbers twice as often?

No. The fact that when we look at the picture numbers repeats doesn't imply the ratio of 2. This ratio is just a special case, if range of the generator is divided into exactly 2 subranges. In general the bias ratio is( k + 1) / k and decreases asymptotically, when divisor n + 1 tends to 1, ( and k tends to N).


Examples

We now consider two simple examples (as suggested by @dyp). First we will generate 1000 * 1000 samples from a distribution given by

x = rand() % m

with generator being std::uniform_int_distribution<> dist(0, 19) and divisor m = n + 1 equal to 15 and next equal to 6.

Example 1

int x = rand() % 15; // n + 1 = 15, rand is uniform distribution over [0,19]

Test program is:

#include <iostream>
#include <random>
#include <vector>

int main()
{
    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_int_distribution<> dist(0, 19);
    std::vector<int> v(15);
    const int runs = 1000 * 1000;
    for (int i = 0; i < runs; ++i)
    {
        ++v[dist(mt) % v.size()];
    }

    for (int i = 0; i < v.size(); ++i)
    {
        std::cout << i << ": " << v[i] << "\n";
    }
}

code

result:

0: 100500 1: 100016 2: 99724 3: 99871 4: 99936 5: 50008 6: 49762 7: 50023 8: 50123 9: 49963 10: 50117 11: 50049 12: 49885 13: 49760 14: 50263

We can see that in this case numbers in range [ 0, p] = [ 0, 4] appears about twice as often as the rest. This is in accordance with our bias formula

bias = f_0 / f_1 = ( k + 1) / k = 2 / 1

Example 2

int x = rand() % 6; // n + 1 = 6, rand is uniform distribution over [0,19]

Test program is:

#include <iostream>
#include <random>
#include <vector>

int main()
{
    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_int_distribution<> dist(0, 19);
    std::vector<int> v(6);
    const int runs = 1000 * 1000;
    for (int i = 0; i < runs; ++i)
    {
        ++v[dist(mt) % v.size()];
    }

    for (int i = 0; i < v.size(); ++i)
    {
        std::cout << i << ": " << v[i] << "\n";
    }
}

code

result:

0: 199875 1: 199642 2: 149852 3: 149789 4: 150237 5: 150605

In this case we observe that numbers in range [ 0, p] = [ 0, 1] appears not about twice as often as the rest but in the ratio of about 20/15. In fact this is 4/3 as our bias formula in this case is

bias = f_0 / f_1 = ( k + 1) / k = 4 / 3

The picture below helps to understand this outcome.

full code



回答2:

The following answer does not go into as much detail as Eric Lippert's blog post on the same topic. Also, this question and its answers deal with the same topic.

Most of the bias that comes from doing rand() % N isn't from the rand() part - it's from the % N part.

Let's consider a 'good' implementation of rand() that generates all numbers from 0 to 100 (picked for simplicity) with equal probability - a uniform distribution. Next let's say that we want to use this implementation of rand() to generate random numbers between 0 and 80, so we do rand() % 80. Let's break down the possibilities of what could happen next:

  1. rand() generates a number from 0 to 79. Any number from 0 to 79 % 80 stays the same number
  2. rand() generates a number from 80 to 100. Any number from 80 to 100 % 80 gets converted to 0 to 20

This means that there are two ways to end up with a number from 0 and 20 but only one way to end up with a number from 21 to 79. Getting a number from 0 to 20 is more likely than getting a number from 21 to 79. This is usually not a desirable property.

Any value of N that divides evenly into the max value of rand() won't have this problem because there will be an equal number of ways to generate any value. Furthermore the bias is much smaller for small values of N than it is for values of N closer to the max value of rand().

So, what about functions other than rand()? If they return values from some fixed range and you do a mod operation, they will suffer from the same bias. If you're calling a random function that takes a range as arguments, then you don't need to do the mod operation. The function will probably handle any biases internally.



回答3:

C++11 has solve this problem by adding alternative random generator engines.

The reason why using %(modulo) to constrain your random number to a range is bad has less to do with bias and more to do with the typical implementation of rand(), a linear congruential generator (LCG). Most language runtimes use LCGs for their random function; only very recently designed languages tend to differ.

An LCG is just a multiply and an add (the modulo usually being implemented via an integer’s maximum size). It should be obvious that the low bits of such a sequence follow a regular pattern – the multiply doesn’t mix higher bits into the lower bits, and the add mutates the low bits in a constant way every iteration.

By understanding the different random generators (linear_congruential_engine, mersenne_twister_engine, subtract_with_carry_engine) engines you can find the best one for your application.

there is a very good reference to the new c++ implementations in Random Engines in c++11

As said by @dpy std::uniform_int_distribution is an option given by c++ for random distributions. It treats the bias problem even if the random generator engine has . But if you set a range from 1-19 and store it in a 15 size array by using % operation the bias problem is reintroduced as discussed in many posts here.