Modulo bias is a problem that arises when naively using the modulo operation to get pseudorandom numbers smaller than a given "upper bound".
Therefore as a C programmer I am using a modified version of the arc4random_uniform()
function to generate evenly distributed pseudorandom numbers.
The problem is I do not understand how the function works, mathematically.
This is the function's explanatory comment, followed by a link to the full source code:
/*
* Calculate a uniformly distributed random number less than upper_bound
* avoiding "modulo bias".
*
* Uniformity is achieved by generating new random numbers until the one
* returned is outside the range [0, 2**32 % upper_bound). This
* guarantees the selected random number will be inside
* [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
* after reduction modulo upper_bound.
*/
http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/crypt/arc4random_uniform.c?rev=1.1&content-type=text/x-cvsweb-markup
From the comment above we can define:
[2^32 % upper_bound, 2^32)
- interval A
[0, upper_bound)
- interval B
In order to work, the function relies on the fact that interval A maps to interval B.
My question is: mathematically, how come the numbers in interval A map uniformly to the ones in interval B? And is there a proof for this?
Sometimes it helps to start with an easily understood example, and then generalize from there. To keep things simple, let's imagine that arc4random
returns a uint8_t
instead of a uint32_t
, so the output from arc4random
is a number in the interval [0,256)
. And let's choose an upper_bound
of 7.
Note that 7 does not divide evenly into 256
256 = 7 * 36 + 4
That means that naively using the modulo operation to get pseudorandom numbers smaller than 7 would result in the following probability distribution
37/256 for outcomes 0,1,2,3
36/256 for outcomes 4,5,6
That's what's known as modulo bias, outcomes 0,1,2,3 are more likely than outcomes 4,5,6.
To avoid modulo bias we could simply reject the values 252,253,254,255, and generate a new number until the result is in the interval [0,252)
. All the numbers in the interval [0,252)
have equal probability (rejecting higher numbers doesn't effect the distribution of the lower numbers). And since 7 divides evenly into 252, the resulting probability distribution is uniform
36/252 for outcomes 0,1,2,3,4,5,6,7
That's essentially what arc4random_uniform
does, except that arc4random_uniform
rejects numbers at the bottom of the range. Specifically, interval A would be
[2^8 % 7, 2^8) which is [4, 256)
After generating a number (call it N
) in the interval [4,256) the final calculation is
outcome = N % 7
There are 252 numbers in the interval [4,256), and since 252 is a multiple of 7, every outcome on the interval [0,7) has equal probability.
That's how arc4random_uniform works, it rejects/retries on a small range of numbers, and the count of numbers in the remaining range is a multiple of the upper_bound. (Since the upper_bound is typically a small number compared to 2^32, the odds of having multiple retries for a single outcome is quite small.)
But do you really care about modulo bias? In most cases, the answer is, "No". Consider our example with an upper bound of 7. The probability distribution for the naive modulo implementation is
613566757 / 4294967296 for outcomes 0,1,2,3
613566756 / 4294967296 for outcomes 4,5,6
which is a modulo bias of less than 0.0000002%.
So there's your choice: either spend a minuscule amount of time on retries to get a perfect distribution, or accept a minuscule error in the probability distribution to avoid the retries.