Modulo bias is a problem that arises when naively using the modulo operation to get pseudorandom numbers smaller than a given "upper bound".
Therefore as a C programmer I am using a modified version of the arc4random_uniform()
function to generate evenly distributed pseudorandom numbers.
The problem is I do not understand how the function works, mathematically.
This is the function's explanatory comment, followed by a link to the full source code:
/*
* Calculate a uniformly distributed random number less than upper_bound
* avoiding "modulo bias".
*
* Uniformity is achieved by generating new random numbers until the one
* returned is outside the range [0, 2**32 % upper_bound). This
* guarantees the selected random number will be inside
* [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
* after reduction modulo upper_bound.
*/
From the comment above we can define:
[2^32 % upper_bound, 2^32)
- interval A[0, upper_bound)
- interval B
In order to work, the function relies on the fact that interval A maps to interval B.
My question is: mathematically, how come the numbers in interval A map uniformly to the ones in interval B? And is there a proof for this?
Sometimes it helps to start with an easily understood example, and then generalize from there. To keep things simple, let's imagine that
arc4random
returns auint8_t
instead of auint32_t
, so the output fromarc4random
is a number in the interval[0,256)
. And let's choose anupper_bound
of 7.Note that 7 does not divide evenly into 256
That means that naively using the modulo operation to get pseudorandom numbers smaller than 7 would result in the following probability distribution
That's what's known as modulo bias, outcomes 0,1,2,3 are more likely than outcomes 4,5,6.
To avoid modulo bias we could simply reject the values 252,253,254,255, and generate a new number until the result is in the interval
[0,252)
. All the numbers in the interval[0,252)
have equal probability (rejecting higher numbers doesn't effect the distribution of the lower numbers). And since 7 divides evenly into 252, the resulting probability distribution is uniformThat's essentially what
arc4random_uniform
does, except thatarc4random_uniform
rejects numbers at the bottom of the range. Specifically, interval A would beAfter generating a number (call it
N
) in the interval [4,256) the final calculation isThere are 252 numbers in the interval [4,256), and since 252 is a multiple of 7, every outcome on the interval [0,7) has equal probability.
That's how arc4random_uniform works, it rejects/retries on a small range of numbers, and the count of numbers in the remaining range is a multiple of the upper_bound. (Since the upper_bound is typically a small number compared to 2^32, the odds of having multiple retries for a single outcome is quite small.)
But do you really care about modulo bias? In most cases, the answer is, "No". Consider our example with an upper bound of 7. The probability distribution for the naive modulo implementation is
which is a modulo bias of less than 0.0000002%.
So there's your choice: either spend a minuscule amount of time on retries to get a perfect distribution, or accept a minuscule error in the probability distribution to avoid the retries.