Is 1.0 a valid output from std::generate_canonical

2020-01-27 12:27发布

问题:

I always thought random numbers would lie between zero and one, without 1, i.e. they are numbers from the half-open interval [0,1). The documention on cppreference.com of std::generate_canonical confirms this.

However, when I run the following program:

#include <iostream>
#include <limits>
#include <random>

int main()
{
    std::mt19937 rng;

    std::seed_seq sequence{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    rng.seed(sequence);
    rng.discard(12 * 629143 + 6);

    float random = std::generate_canonical<float,
                   std::numeric_limits<float>::digits>(rng);

    if (random == 1.0f)
    {
        std::cout << "Bug!\n";
    }

    return 0;
}

It gives me the following output:

Bug!

i.e. it generates me a perfect 1, which causes problems in my MC integration. Is that valid behavior or is there an error on my side? This gives the same output with G++ 4.7.3

g++ -std=c++11 test.c && ./a.out

and clang 3.3

clang++ -stdlib=libc++ -std=c++11 test.c && ./a.out

If this is correct behavior, how can I avoid 1?

Edit 1: G++ from git seems to suffer from the same problem. I am on

commit baf369d7a57fb4d0d5897b02549c3517bb8800fd
Date:   Mon Sep 1 08:26:51 2014 +0000

and compiling with ~/temp/prefix/bin/c++ -std=c++11 -Wl,-rpath,/home/cschwan/temp/prefix/lib64 test.c && ./a.out gives the same output, ldd yields

linux-vdso.so.1 (0x00007fff39d0d000)
libstdc++.so.6 => /home/cschwan/temp/prefix/lib64/libstdc++.so.6 (0x00007f123d785000)
libm.so.6 => /lib64/libm.so.6 (0x000000317ea00000)
libgcc_s.so.1 => /home/cschwan/temp/prefix/lib64/libgcc_s.so.1 (0x00007f123d54e000)
libc.so.6 => /lib64/libc.so.6 (0x000000317e600000)
/lib64/ld-linux-x86-64.so.2 (0x000000317e200000)

Edit 2: I reported the behavior here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63176

Edit 3: The clang team seems to be aware of the problem: http://llvm.org/bugs/show_bug.cgi?id=18767

回答1:

The problem is in mapping from the codomain of std::mt19937 (std::uint_fast32_t) to float; the algorithm described by the standard gives incorrect results (inconsistent with its description of the output of the algorithm) when loss of precision occurs if the current IEEE754 rounding mode is anything other than round-to-negative-infinity (note that the default is round-to-nearest).

The 7549723rd output of mt19937 with your seed is 4294967257 (0xffffffd9u), which when rounded to 32-bit float gives 0x1p+32, which is equal to the max value of mt19937, 4294967295 (0xffffffffu) when that is also rounded to 32-bit float.

The standard could ensure correct behavior if it were to specify that when converting from the output of the URNG to the RealType of generate_canonical, rounding is to be performed towards negative infinity; this would give a correct result in this case. As QOI, it would be good for libstdc++ to make this change.

With this change, 1.0 will no longer be generated; instead the boundary values 0x1.fffffep-N for 0 < N <= 8 will be generated more often (approximately 2^(8 - N - 32) per N, depending on the actual distribution of MT19937).

I would recommend to not use float with std::generate_canonical directly; rather generate the number in double and then round towards negative infinity:

    double rd = std::generate_canonical<double,
        std::numeric_limits<float>::digits>(rng);
    float rf = rd;
    if (rf > rd) {
      rf = std::nextafter(rf, -std::numeric_limits<float>::infinity());
    }

This problem can also occur with std::uniform_real_distribution<float>; the solution is the same, to specialize the distribution on double and round the result towards negative infinity in float.



回答2:

According to the standard, 1.0 is not valid.

C++11 §26.5.7.2 Function template generate_canonical

Each function instantiated from the template described in this section 26.5.7.2 maps the result of one or more invocations of a supplied uniform random number generator g to one member of the specified RealType such that, if the values gi produced by g are uniformly distributed, the instantiation’s results tj , 0 ≤ tj < 1, are distributed as uniformly as possible as specified below.



回答3:

I just ran into a similar question with uniform_real_distribution, and here's how I interpret the Standard's parsimonious wording on the subject:

The Standard always defines math functions in terms of math, never in terms of IEEE floating-point (because the Standard still pretends that floating-point might not mean IEEE floating point). So, any time you see mathematical wording in the Standard, it's talking about real math, not IEEE.

The Standard says that both uniform_real_distribution<T>(0,1)(g) and generate_canonical<T,1000>(g) should return values in the half-open range [0,1). But these are mathematical values. When you take a real number in the half-open range [0,1) and represent it as IEEE floating-point, well, a significant fraction of the time it will round up to T(1.0).

When T is float (24 mantissa bits), we expect to see uniform_real_distribution<float>(0,1)(g) == 1.0f about 1 in 2^25 times. My brute-force experimentation with libc++ confirms this expectation.

template<class F>
void test(long long N, const F& get_a_float) {
    int count = 0;
    for (long long i = 0; i < N; ++i) {
        float f = get_a_float();
        if (f == 1.0f) {
            ++count;
        }
    }
    printf("Expected %d '1.0' results; got %d in practice\n", (int)(N >> 25), count);
}

int main() {
    std::mt19937 g(std::random_device{}());
    auto N = (1uLL << 29);
    test(N, [&g]() { return std::uniform_real_distribution<float>(0,1)(g); });
    test(N, [&g]() { return std::generate_canonical<float, 32>(g); });
}

Example output:

Expected 16 '1.0' results; got 19 in practice
Expected 16 '1.0' results; got 11 in practice

When T is double (53 mantissa bits), we expect to see uniform_real_distribution<double>(0,1)(g) == 1.0 about 1 in 2^54 times. I don't have the patience to test this expectation. :)

My understanding is that this behavior is fine. It may offend our sense of "half-open-rangeness" that a distribution claiming to return numbers "less than 1.0" can in fact return numbers that are equal to 1.0; but those are two different meanings of "1.0", see? The first is the mathematical 1.0; the second is the IEEE single-precision floating-point number 1.0. And we've been taught for decades not to compare floating-point numbers for exact equality.

Whatever algorithm you feed the random numbers into isn't going to care if it sometimes gets exactly 1.0. There's nothing you can do with a floating-point number except mathematical operations, and as soon as you do some mathematical operation, your code will have to deal with rounding. Even if you could legitimately assume that generate_canonical<float,1000>(g) != 1.0f, you still wouldn't be able to assume that generate_canonical<float,1000>(g) + 1.0f != 2.0f — because of rounding. You just can't get away from it; so why would we pretend in this single instance that you can?



标签: c++ c++11 random