Uniform distribution of integers using floating po

2020-03-04 03:24发布

问题:

The standard way to get a random integer in the range [0, n) in JavaScript - or any other language that only offers a random() function that returns a float in the range [0,1) - is to use Math.floor(Math.random() * n).

Now the math behind this is trivial assuming we're operating on the set of rational numbers. The question is: With all the complications of IEEE-754 floating point numbers is the resulting distribution actually really uniform?

Considering that the gap between one floating point number and the next higher one increases as they grow larger I would think that this should introduce some kind of bias towards smaller numbers.

回答1:

No, the resulting distribution is not going to be perfectly uniform, for most values of n. For small values, it'll be so close to uniform that you'd have a hard time detecting any difference from a uniform distribution, but as n gets larger the bias can become noticeable.

To illustrate, here's some Python code (not JavaScript, sorry, but the principle is the same):

from collections import Counter
from random import random

def badrand(n):
    return int(random() * n)

print(Counter(badrand(6755399441055744) % 3 for _ in range(10000000)))

This is producing 10 million random integers in the range [0, 6755399441055744), reducing each of those integers modulo 3, and counting the number of times the remainder is 0, 1, or 2. If we're generating those integers uniformly, we'd expect the remainders modulo 3 to be roughly evenly distributed, so we'd expect the counts to be similar.

Here's an example result from running this on my machine:

Counter({1: 3751915, 0: 3334643, 2: 2913442})

That is, a remainder of 1 is significantly more likely to occur than 0, which in turn is significantly more likely to occur than a remainder of 2. The differences here are way too big to be explained by random variation.

So what went wrong? Python's random() function is relatively high quality, based on the Mersenne Twister, so we're unlikely to be seeing statistical problems resulting from the base random number generator. What's happening is that random() generates one of 2^53 (roughly) equally likely outcomes - each outcome is a number of the form x / 2^53 for some integer x in the range [0, 2^53). Now in the badrand call, we're effectively mapping those outcomes to 6755399441055744 possible outputs. Now that value wasn't chosen at random (ha!); it's exactly 3/4 of 2^53. That means that under the most uniform distribution possible, 2/3 of the possible badrand output values are being hit by exactly one of the 2^53 possible random() output values, while the other 1/3 are being hit by two of the 2^53 possible random() output values. That is, some of the potential outputs are twice as likely to occur as others. So we're a long way from uniform.

You're going to see the same effect in JavaScript. In the case of Chrome, it appears that there are only 2^32 distinct results from Math.random(), so you should be able to find effects like the above with n smaller than (but close to) 2^32.

Of course, the same effect holds for small n, too: if n = 5, then because 5 is not a divisor of 2^32 there's no way we can perfectly evenly distribute all 2^32 possible Math.random() results between the 5 desired outcomes: the best we can hope for is that 4 of the 5 outcomes appear for 858993459 of the possible random() results each, while the fifth occurs for 858993460 of the random() results. But that distribution is going to be so close to uniform that it would be well-nigh impossible to find any statistical test to tell you differently. So for practical purposes, you should be safe with small n.

There's a related Python bug that might be interesting at http://bugs.python.org/issue9025. That bug was solved for Python 3 by moving away from the int(random() * n) method of computing these numbers. The bug still remains in Python 2, though.



回答2:

If Math.random (or equivalent) generated a uniformly-distributed bit pattern out of those bit patterns corresponding to floating point numbers in the range [0, 1), then it would produce an extremely biased sample. There are as many representable floating point number in [0.25, 0.5) as there are in [0.5, 1.0), which is also the same number of representable values in [0.125, 0.25). And so on. In short, the uniformly-distributed bit patterns would result in only one out of a thousand values being between 0.5 and 1.0. (assuming double-precision floating point numbers.)

Fortunately, that's not what Math.random does. One simple way of getting a uniformly distributed number (rather than bit pattern) is to generate a uniformly distributed bit pattern in [1.0, 2.0), and then subtract 1.0; that's a fairly common strategy.

Regardless, the end result of Math.floor(Math.random() * n) is not quite uniformly distributed unless n is a power of 2, because of quantification bias. The number of possible floating point values which could be returned by Math.random is a power of 2, and if n is not a power of 2, then it is impossible to distribute the possible floating point values precisely evenly over all values of integers in [0, n). If Math.random returns a double-precision floating pointer number and n is not huge, this bias is small, but it certainly exists.



回答3:

According to http://es5.github.io/x15.8.html#x15.8.2.14

the functionality of Math.random

Returns a Number value with positive sign, greater than or equal to 0 but less than 1, chosen randomly or pseudo randomly with approximately uniform distribution over that range, using an implementation-dependent algorithm or strategy. This function takes no arguments.

check out this post: https://stats.stackexchange.com/questions/40384/fake-uniform-random-numbers-more-evenly-distributed-than-true-uniform-data

this has become above my head, sorry i have nothing left to contribute



回答4:

Assuming random() is returning a number between 0..1.

If the result is a single precision float than thats only 23bits of entropy based on the mantissa.

If the result is a double precision float than thats only 52bits of entropy based on the mantissa.

So floor(random() * N) would only be uniform where N is less than 2^24 or 2^53.

EDIT Here's some info on largest consecutive integer for floating point http://www.mathworks.com/help/matlab/ref/flintmax.html



回答5:

I assume that your remark that "the gap between one floating point number and the next higher one increases as they grow larger" is based on the following:

in IEEE-754 you have a fixed-size mantissa, which allows for uniform "random" values in the range [1,2) say, and there's an equal number of possible values in [2,4) which is twice as large a range, so we get a 2-fold spacing between the possible values, again twice as large for [4,8), etcetera.

Now, I haven't checked out the technical details behind ".., using an implementation-dependent algorithm or strategy", when they talk about the properties of the random numbers generated for [0,1), but since the above consideration is so trivial, I do assume that the random-generator programmers have been aware of this and took care of it with an "implementation-dependent algo...".

Therefore, being a naive guy, I do believe that there's nothing to worry about with respect to (my assumption of) your reason for suspicion. In fact, I might take it that, if you can generate uniform and random values for the mantissa, then setting always the same exponent, such that the values belong to [1,2), you subtract 1 from everything and have an appropriate distribution for [0,1).