I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is "modulo bias" when using a random number generator, like rand()
in C++.
相关问题
- Sorting 3 numbers without branching [closed]
- How to compile C++ code in GDB?
- Why does const allow implicit conversion of refere
- thread_local variables initialization
- What uses more memory in c++? An 2 ints or 2 funct
相关文章
- Class layout in C++: Why are members sometimes ord
- How to mock methods return object with deleted cop
- Which is the best way to multiply a large and spar
- C++ default constructor does not initialize pointe
- Selecting only the first few characters in a strin
- What exactly do pointers store? (C++)
- Converting glm::lookat matrix to quaternion and ba
- What is the correct way to declare and use a FILE
There are two usual complaints with the use of modulo.
one is valid for all generators. It is easier to see in a limit case. If your generator has a RAND_MAX which is 2 (that isn't compliant with the C standard) and you want only 0 or 1 as value, using modulo will generate 0 twice as often (when the generator generates 0 and 2) as it will generate 1 (when the generator generates 1). Note that this is true as soon as you don't drop values, whatever the mapping you are using from the generator values to the wanted one, one will occurs twice as often as the other.
some kind of generator have their less significant bits less random than the other, at least for some of their parameters, but sadly those parameter have other interesting characteristic (such has being able to have RAND_MAX one less than a power of 2). The problem is well known and for a long time library implementation probably avoid the problem (for instance the sample rand() implementation in the C standard use this kind of generator, but drop the 16 less significant bits), but some like to complain about that and you may have bad luck
Using something like
to generate a random number between 0 and n will avoid both problems (and it avoids overflow with RAND_MAX == INT_MAX)
BTW, C++11 introduced standard ways to the the reduction and other generator than rand().
@user1413793 is correct about the problem. I'm not going to discuss that further, except to make one point: yes, for small values of
n
and large values ofRAND_MAX
, the modulo bias can be very small. But using a bias-inducing pattern means that you must consider the bias every time you calculate a random number and choose different patterns for different cases. And if you make the wrong choice, the bugs it introduces are subtle and almost impossible to unit test. Compared to just using the proper tool (such asarc4random_uniform
), that's extra work, not less work. Doing more work and getting a worse solution is terrible engineering, especially when doing it right every time is easy on most platforms.Unfortunately, the implementations of the solution are all incorrect or less efficient than they should be. (Each solution has various comments explaining the problems, but none of the solutions have been fixed to address them.) This is likely to confuse the casual answer-seeker, so I'm providing a known-good implementation here.
Again, the best solution is just to use
arc4random_uniform
on platforms that provide it, or a similar ranged solution for your platform (such asRandom.nextInt
on Java). It will do the right thing at no code cost to you. This is almost always the correct call to make.If you don't have
arc4random_uniform
, then you can use the power of opensource to see exactly how it is implemented on top of a wider-range RNG (ar4random
in this case, but a similar approach could also work on top of other RNGs).Here is the OpenBSD implementation:
It is worth noting the latest commit comment on this code for those who need to implement similar things:
The Java implementation is also easily findable (see previous link):
So
rand()
is a pseudo-random number generator which chooses a natural number between 0 andRAND_MAX
, which is a constant defined incstdlib
(see this article for a general overview onrand()
).Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let's say
RAND_MAX
is 10 and I decide to generate a random number between 0 and 2 by callingrand()%3
. However,rand()%3
does not produce the numbers between 0 and 2 with equal probability!When
rand()
returns 0, 3, 6, or 9,rand()%3 == 0
. Therefore, P(0) = 4/11When
rand()
returns 1, 4, 7, or 10,rand()%3 == 1
. Therefore, P(1) = 4/11When
rand()
returns 2, 5, or 8,rand()%3 == 2
. Therefore, P(2) = 3/11This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.
So when does
rand()%n
return a range of numbers from 0 to n-1 with equal probability? WhenRAND_MAX%n == n - 1
. In this case, along with our earlier assumptionrand()
does return a number between 0 andRAND_MAX
with equal probability, the modulo classes of n would also be equally distributed.So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:
but that's inefficient for low values of
n
, since you only have an/RAND_MAX
chance of getting a value in your range, and so you'll need to performRAND_MAX/n
calls torand()
on average.A more efficient formula approach would be to take some large range with a length divisible by
n
, likeRAND_MAX - RAND_MAX % n
, keep generating random numbers until you get one that lies in the range, and then take the modulus:For small values of
n
, this will rarely require more than one call torand()
.Works cited and further reading:
CPlusPlus Reference
Eternally Confuzzled