How do I truncate the significand of a floating po

This question already has an answer here:

Efficient way to round double precision numbers to a lower precision given in number of bits 2 answers

I would like to introduce some artificial precision loss into two numbers being compared to smooth out minor rounding errors so that I don't have to use the Math.abs(x - y) < eps idiom in every comparison involving x and y.

Essentially, I want something that behaves similarly to down-casting a double to a float and then up-casting it back to a double, except I want to also preserve very large and very small exponents and I want some control over the number of significand bits preserved.

Given the following function that produces the binary representation of the significand of a 64-bit IEEE 754 number:

public static String significand(double d) {
    int SIGN_WIDTH = 1;
    int EXP_WIDTH = 11;
    int SIGNIFICAND_WIDTH = 53;
    String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
    return s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH);
}

I want a function reducePrecision(double x, int bits) that reduces the precision of the significand of a double such that:

significand(reducePrecision(x, bits)).substring(bits).equals(String.format("%0" + (52 - bits) + "d", 0))

In other words, every bit after the bits-most significant bit in the significand of reducePrecision(x, bits) should be 0, while the bits-most significant bits in the significand of reducePrecision(x, bits) should reasonably approximate the bits-most signicant bits in the significand of x.

标签： java floating-point precision ieee-754

1条回答

劳资没心，怎么记你

2楼-- · 2019-09-23 00:04

Suppose x is the number you wish to reduce the precision of and bits is the number of significant bits you wish to retain.

When bits is sufficiently large and the order of magnitude of x is sufficiently close to 0, then x * (1L << (bits - Math.getExponent(x))) will scale x so that the bits that need to be removed will appear in the fractional component (after the radix point) while the bits that will be retained will appear in the integer component (before the radix point). You can then round this to remove the fractional component and then divide the rounded number by (1L << (bits - Math.getExponent(x))) to restore the order of magnitude of x, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * (1L << exponent)) / (1L << exponent);
}

However, (1L << exponent) will break down when Math.getExponent(x) > bits || Math.getExponent(x) < bits - 62. The solution is to use Math.pow(2, exponent) (or the fast pow2(exponent) implementation from this answer) to calculate a fractional, or a very large, power of 2, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * Math.pow(2, exponent)) * Math.pow(2, -exponent);
}

However, Math.pow(2, exponent) will break down as exponent approaches -1074 or +1023. The solution is to use Math.scalb(x, exponent) so that the power of 2 doesn't have to be explicitly calculated, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.round(Math.scalb(x, exponent)), -exponent);
}

However, Math.round(y) returns a long so it does not preserve Infinity, NaN, and cases where Math.abs(x) > Long.MAX_VALUE / Math.pow(2, exponent). Furthermore, Math.round(y) always rounds ties to positive infinity (e.g. Math.round(0.5) == 1 && Math.round(1.5) == 2). The solution is to use Math.rint(y) to receive a double and preserve the unbiased IEEE 754 round-to-nearest, ties-to-even rule (e.g. Math.rint(0.5) == 0.0 && Math.rint(1.5) == 2.0), i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.rint(Math.scalb(x, exponent)), -exponent);
}

Finally, here is a unit test confirming our expectations:

public static String decompose(double d) {
    int SIGN_WIDTH = 1;
    int EXP_WIDTH = 11;
    int SIGNIFICAND_WIDTH = 53;
    String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
    return s.substring(0, 0 + SIGN_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH + EXP_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH + SIGNIFICAND_WIDTH - 1);
}

public static void test() {
    // Use a fixed seed so the generated numbers are reproducible.
    java.util.Random r = new java.util.Random(0);

    // Generate a floating point number that makes use of its full 52 bits of significand precision.
    double a = r.nextDouble() * 100;
    System.out.println(decompose(a) + " " + a);
    Assert.assertFalse(decompose(a).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));

    // Cast the double to a float to produce a "ground truth" of precision loss to compare against.
    double b = (float) a;
    System.out.println(decompose(b) + " " + b);
    Assert.assertTrue(decompose(b).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
    // 32-bit float has a 23 bit significand, so c's bit pattern should be identical to b's bit pattern.
    double c = reducePrecision(a, 23);
    System.out.println(decompose(c) + " " + c);
    Assert.assertTrue(b == c);

    // 23rd-most significant bit in c is 1, so rounding it to the 22nd-most significant bit requires breaking a tie.
    // Since 22nd-most significant bit in c is 0, d will be rounded down so that its 22nd-most significant bit remains 0.
    double d = reducePrecision(c, 22);
    System.out.println(decompose(d) + " " + d);
    Assert.assertTrue(decompose(d).split(" ")[2].substring(22).equals(String.format("%0" + (52 - 22) + "d", 0)));
    Assert.assertTrue(decompose(c).split(" ")[2].charAt(22) == '1' && decompose(c).split(" ")[2].charAt(21) == '0');
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(21) == '0');
    // 21st-most significant bit in d is 1, so rounding it to the 20th-most significant bit requires breaking a tie.
    // Since 20th-most significant bit in d is 1, e will be rounded up so that its 20th-most significant bit becomes 0.
    double e = reducePrecision(c, 20);
    System.out.println(decompose(e) + " " + e);
    Assert.assertTrue(decompose(e).split(" ")[2].substring(20).equals(String.format("%0" + (52 - 20) + "d", 0)));
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(20) == '1' && decompose(d).split(" ")[2].charAt(19) == '1');
    Assert.assertTrue(decompose(e).split(" ")[2].charAt(19) == '0');

    // Reduce the precision of a number close to the largest normal number.
    double f = reducePrecision(a * 0x1p+1017, 23);
    System.out.println(decompose(f) + " " + f);
    // Reduce the precision of a number close to the smallest normal number.
    double g = reducePrecision(a * 0x1p-1028, 23);
    System.out.println(decompose(g) + " " + g);
    // Reduce the precision of a number close to the smallest subnormal number.
    double h = reducePrecision(a * 0x1p-1051, 23);
    System.out.println(decompose(h) + " " + h);
}

And its output:

0 10000000101 0010010001100011000110011111011100100100111000111011 73.0967787376657
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110000000000000000000000000000000 73.09677124023438
0 10000000101 0010010001100011001000000000000000000000000000000000 73.0968017578125
0 11111111110 0010010001100011000110100000000000000000000000000000 1.0266060746443803E308
0 00000000001 0010010001100011000110100000000000000000000000000000 2.541339559435826E-308
0 00000000000 0000000000000000000000100000000000000000000000000000 2.652494739E-315

0人赞添加讨论(0) 举报

How do I truncate the significand of a floating po

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间