64-bit unsigned integers which cannot map onto a d

2020-04-18 06:11发布

问题:

Are there any 64-bit unsigned integer values which cannot be represented with a double-precision floating-point type? (As a double is also 64-bit wide there must be some.) If so, how can I calculate all of them? (In a not brute force way, maybe?)

回答1:

Every integer from 0 to 2^52 inclusive is representable exactly, from 2^52 to 2^53 only every even integer (lowest significant bit of 0), then every fourth integer, up to 2^64-2^12.

We could generalise with a bit of code,

taking m=52 :

    for (i=0; i<(64-m+1); i++) {
            start = i ? 1ULL << (i+m) : 0;
            end = ((1ULL << m+1)-1) << i;
            step = 1ULL << i;
    }

produces :

0000000000000000 to 001fffffffffffff step 1
0020000000000000 to 003ffffffffffffe step 2
0040000000000000 to 007ffffffffffffc step 4
0080000000000000 to 00fffffffffffff8 step 8
0100000000000000 to 01fffffffffffff0 step 16
0200000000000000 to 03ffffffffffffe0 step 32
0400000000000000 to 07ffffffffffffc0 step 64
0800000000000000 to 0fffffffffffff80 step 128
1000000000000000 to 1fffffffffffff00 step 256
2000000000000000 to 3ffffffffffffe00 step 512
4000000000000000 to 7ffffffffffffc00 step 1024
8000000000000000 to fffffffffffff800 step 2048

Example :

Assigning 0x0020000000000000 to a double gives 9007199254740992.0 (0x0x4340000000000000 in IEEE754)

Assigning 0x0020000000000001 to a double gives 9007199254740992.0 (same value)

Assigning 0x0020000000000002 to a double gives 9007199254740994.0 (0x0x4340000000000001 , which is the next representable value)



回答2:

An IEEE754 double precision value has 53 bits of significand, so any 64-bit unsigned ints which have more than 53 consecutive significant bits (i.e the distance between the first 1 bit to the last 1 bit is more than 53 bits in length) cannot be losslessly converted to double.



回答3:

If a 64-bit number is represented as following:

52 A bits, followed by at least 1 B bit, followed by a single "1" bit.

where A is any bit, and one of the B bits must be non zero, then it cannot be represented as a double. (I am relying on the way bits are used for double, as shown in http://en.wikipedia.org/wiki/Double-precision_floating-point_format)