How many bits of precision for a double between -1

In some of the audio libraries I've been looking at, a sample of audio is often represented as a double or a float with a range of -1.0 to 1.0. In some cases, this easily allows analysis and synthesis code to abstract what the underlying datatype might actually be (signed long int, unsigned char, etc).

Assuming IEEE 754, we have non-uniform density. As the number approaches zero, the density increases. This means that we have less precision for numbers approaching -1 and 1.

This non-uniform number density doesn't matter if we can represent a sufficient number of values for the underlying datatype that we're converting to/from.

For instance, if the underlying data type was an unsigned char, we only need 256 values between -1 and 1 (or 8 bits)--using a double is clearly not a problem.

My question is: how many bits of precision do I have? Can I safely convert to/from an 32 bit integer without loss? To extend the question, what would the range of values have to be to safely convert to/from a 32 bit integer without loss?

Thanks!

回答1:

For IEEE doubles, you have a 53 bit mantissa, which is enough to represent 32 bit integers considered as fixed point numbers between -1 (0x80000000) and 1 - 2^-31 (0x7FFFFFFF).

Floats have 24 bit mantissae, which is not enough.

回答2:

As Alexandre C. explains, IEEE doubles have a 53 bit mantissa (52 stored and top bit implied), and float has 24 bits (23 bits stored, and top bit implied).

Edit: (Thanks for feedback, I hope this is clearer)

When an integer is converted to a double double f = (double)1024;, the number is held with an appropriate exponent (1023+10), and the same bit pattern is effectively stored as the original integer (Actually IEEE binary floating point does not store the top bit. IEEE floating point numbers are 'normalised' to have the top bit = 1, by adjusting the exponent, then the top 1 is trimmed off because it is 'implied', which saves a bit of storage).

A 32bit integer will require a double to hold its value perfectly, and an 8bit integer will be held perfectly in a float. There is no loss of information there. It can be converted back to an integer without loss. Loss happens with arithmetic, and fractional values.

The integer is not mapped to +/-1 unless code does it. When code divides that 32bit integer, stored as a double, to map it to the range +/-1, then error will very likely be introduced.

That mapping to +/-1 will loose some of the 53bit precision, but the error will only be in the lowest bits, well below the 32bits needed for the original integer. Subsequent operations might also lose precision. For example multiplying two numbers with a resulting range of more than 53 bits of precision will lose some bits (i.e. multiply two numbers with mantissa's more than 27 significant bits).

An explanation of floating point which might be helpful is "What Every Computer Scientist Should Know About Floating-Point Arithmetic" It explains some of the counter-intuitive (to me) behaviour of floating point numbers.

For example, the number 0.1 can not be held exactly in an IEEE binary floating point double.

This program might help you see what is happening:

/* Demonstrate IEEE 'double' encoding on x86
 * Show bit patterns and 'printf' output for double values
 * Show error representing 0.1, and accumulated error of adding 0.1 many times
 * G Bulmer 2012
 */


#include <stdio.h>

typedef struct {
    unsigned long long mantissa :52; 
    unsigned exponent :11; 
    unsigned sign :1; 
} double_bits;
const unsigned exponent_offset = 1023;

typedef union { double d; unsigned long long l; double_bits b; } Xlate;

void print_xlate(Xlate val) {
    const long long IMPLIED = (1LL<<52);
    if (val.b.exponent == 0) { /* zero? */
        printf("val: d: %19lf  bits: %016llX [sign: %u exponent: zero=%u mantissa: %llX]\n", 
               val.d, val.l, val.b.sign, val.b.exponent, val.b.mantissa);
    } else {
        printf("val: d: %19lf  bits: %016llX [sign: %u exponent: 2^%4-d mantissa: %llX]\n", 
               val.d, val.l, val.b.sign, ((int)val.b.exponent)-exponent_offset, 
                                                    (IMPLIED|val.b.mantissa));
    }
}


double add_many(double d, int many) {
    double accum = 0.0;
    while (many-- > 0) {    /* only works for +d */
        accum += d;
    }
    return accum;
}

int main (int argc, const char * argv[]) {
    Xlate val;
    val.b.sign = 0;
    val.b.exponent = exponent_offset+1;
    val.b.mantissa = 0;

    print_xlate(val);

    val.d = 1.0;                        print_xlate(val);

    val.d = 0.0;                        print_xlate(val);

    val.d = -1.0;                       print_xlate(val);

    val.d = 3.0;                        print_xlate(val);

    val.d = 7.0;                        print_xlate(val);

    val.d = (double)((1LL<<31)-1LL);    print_xlate(val);

    val.d = 2147483647.0;               print_xlate(val);

    val.d = 10000.0;                    print_xlate(val);

    val.d = 100000.0;                   print_xlate(val);

    val.d = 1000000.0;                  print_xlate(val);

    val.d = 0.1;                        print_xlate(val);

    val.d = add_many(0.1, 100000);
    print_xlate(val);

    val.d = add_many(0.1, 1000000);
    print_xlate(val);

    val.d = add_many(0.1, 10000000);
    print_xlate(val);

    val.d = add_many(0.1,10);           print_xlate(val);
    val.d *= 2147483647.0;              print_xlate(val);
    int i = val.d;                      printf("int i=truncate(d)=%d\n", i);
    int j = lround(val.d);              printf("int i=lround(d)=%d\n", j);

    val.d = add_many(0.0001,1000)-0.1;  print_xlate(val);

    return 0;
}

The output is:

val: d:            2.000000  bits: 4000000000000000 [sign: 0 exponent: 2^1    mantissa: 10000000000000]
val: d:            1.000000  bits: 3FF0000000000000 [sign: 0 exponent: 2^0    mantissa: 10000000000000]
val: d:            0.000000  bits: 0000000000000000 [sign: 0 exponent: zero=0 mantissa: 0]
val: d:           -1.000000  bits: BFF0000000000000 [sign: 1 exponent: 2^0    mantissa: 10000000000000]
val: d:            3.000000  bits: 4008000000000000 [sign: 0 exponent: 2^1    mantissa: 18000000000000]
val: d:            7.000000  bits: 401C000000000000 [sign: 0 exponent: 2^2    mantissa: 1C000000000000]
val: d:   2147483647.000000  bits: 41DFFFFFFFC00000 [sign: 0 exponent: 2^30   mantissa: 1FFFFFFFC00000]
val: d:   2147483647.000000  bits: 41DFFFFFFFC00000 [sign: 0 exponent: 2^30   mantissa: 1FFFFFFFC00000]
val: d:        10000.000000  bits: 40C3880000000000 [sign: 0 exponent: 2^13   mantissa: 13880000000000]
val: d:       100000.000000  bits: 40F86A0000000000 [sign: 0 exponent: 2^16   mantissa: 186A0000000000]
val: d:      1000000.000000  bits: 412E848000000000 [sign: 0 exponent: 2^19   mantissa: 1E848000000000]
val: d:            0.100000  bits: 3FB999999999999A [sign: 0 exponent: 2^-4   mantissa: 1999999999999A]
val: d:        10000.000000  bits: 40C388000000287A [sign: 0 exponent: 2^13   mantissa: 1388000000287A]
val: d:       100000.000001  bits: 40F86A00000165CB [sign: 0 exponent: 2^16   mantissa: 186A00000165CB]
val: d:       999999.999839  bits: 412E847FFFEAE4E9 [sign: 0 exponent: 2^19   mantissa: 1E847FFFEAE4E9]
val: d:            1.000000  bits: 3FEFFFFFFFFFFFFF [sign: 0 exponent: 2^-1   mantissa: 1FFFFFFFFFFFFF]
val: d:   2147483647.000000  bits: 41DFFFFFFFBFFFFF [sign: 0 exponent: 2^30   mantissa: 1FFFFFFFBFFFFF]
int i=truncate(d)=2147483646
int i=lround(d)=2147483647
val: d:            0.000000  bits: 3CE0800000000000 [sign: 0 exponent: 2^-49  mantissa: 10800000000000]

That shows a full 32-bit int is represented exactly, and 0.1 is not. It shows that printf does not print exactly the floating point number but rounds or truncates (a thing to be wary of). It also illustrates that the amount of error in that representation of 0.1 doesn't accumulate to a large enough value in 1,000,000 add operations to cause printf to print it. It shows that the original integer can be recovered by rounding, but not assignment because assignment truncates. It shows that the subtraction operation can 'amplify' error (all that is left after that subtraction is error), and hence arithmetic should be carefully analysed.

To put this into the context of music, where the sample rate might be 96KHz. It would take more than 10 seconds of additions before the error had built up enough for it to cause the top 32bits to contain more than 1 bit in error.

Further. Christopher “Monty” Montgomery who created Ogg and Vorbis argues that 24 bits should be more than enough for audio in an article on music, sampling rate and sample resolution 24/192 Music Downloads ...and why they make no sense

Summary
double holds 32-bit integers perfectly. There are rational decimal numbers of the form N/M (where M and N can be represented by a 32bit integer) which can not be represented by a finite sequence of binary-fraction bits. So, when an integer is mapped to the range +/-1, and hence is converted to a rational number (N/M) some numbers can not be represented by the finite number of bits in a double's fractional part, so errors will creep in.

Those errors are typically very small, in the lowest bits, hence well below the upper 32 bits. So they can be converted back and forth between integer and double using rounding, and the error of double representation will not cause the integer to be wrong. BUT, arithmetic can change error. Incorrectly constructed arithmetic can cause the errors to grow rapidly, and could grow to a magnitude where the original integer value has been corrupted.

Other thoughts: If precision is critical, there are other ways you might use doubles. None of them are as convenient as mapping to +/-1. Everything I can think of would require the arithmetic operations to be tracked, which would best be done using C++ wrapper classes. This would dramatically slow calculation, so may be pointless.

This is a very sneaky way of doing 'Automatic Diferentiation' by wrapping arithmetic in classes which keep track of extra information. I think the ideas in there might inspire an approach. It might even help identify where precision is lost.