sum of small double numbers c++

2019-01-22 22:20发布

问题:

Assume that we have an array of small (about 10^(-15) ) double numbers in c++. If we calculate the sum of numbers in this array sequentially, for example

double sum = 0;
for (int i = 0; i < n; i++) sum+=array[i];

we get some value x.

But if we divide an array into some parts and then calculate the sum in each part and after this we add all the partial sums together we get some value x2, which is close to x but not exactly x. So I have lost accruacy in calculating sum.

Does someone know how to calculate the sum of small double numbers by partitioning these numbers into some parts without loosing accuracy?

回答1:

Using Kahan Summation:

#include <numeric>
#include <iostream>
#include <vector>

struct KahanAccumulation
{
    double sum;
    double correction;
};

KahanAccumulation KahanSum(KahanAccumulation accumulation, double value)
{
    KahanAccumulation result;
    double y = value - accumulation.correction;
    double t = accumulation.sum + y;
    result.correction = (t - accumulation.sum) - y;
    result.sum = t;
    return result;
}

int main()
{
    std::vector<double> numbers = {0.01, 0.001, 0.0001, 0.000001, 0.00000000001};
    KahanAccumulation init = {0};
    KahanAccumulation result =
        std::accumulate(numbers.begin(), numbers.end(), init, KahanSum);

    std::cout << "Kahan Sum: " << result.sum << std::endl;
    return 0;
}

Output:

Kahan Sum: 0.011101

Code here.



回答2:

The absolute size of the numbers is not the issue.

If you want a more accurate summation, have you considered a compensated sum? http://en.wikipedia.org/wiki/Kahan_summation_algorithm

However, if you really mean without losing any accuracy, your result will not necessarily fit in a double. If this is really what you want, you could look Algorithm 908 at http://dl.acm.org/citation.cfm?id=1824815 or similar.



回答3:

The trick in those cases is to first order the array from smaller to higher, and then sum then in the cycle you've made. That way, the accuracy is best.

You can also check Kahan summation algorithm



回答4:

Consider to apply Kahan summation algorithm for both your entire set or each of your subsets.

There are other questions referencing this algorithm that can help you



回答5:

The double numbers in the computer are stored in binary numeric system. That is why when you see a double value(in decimal notation) you in fact see the double value with some rounding(for instance 0.1 is infinite fraction). You can do the same experiment where the double values are degree of 2(for instance 2^(-30)) and then you will see that the values will match.

The reason that you observe difference when you sum the double values in different sequence is that after each calculation the result will be rounded in binary numeric system and so a little difference from the actual value will appear.



回答6:

Binary floating point numbers used to represent decimal numbers have more precision than accuracy. You have found one way of surfacing the difference.



回答7:

It could be that your individual summations are being optimised and performed in register at 80 bits but then transfered back to 64 doubles (read about compiler switches). Naturally this would lose precision. If this is the case then breaking up the array and adding the individual 64-bit sums would give a different answer to adding them all as 80-bit aand converting the grand total back.

This may not be the reason but it might be worth researching further. Look at the chosen answer to this question



回答8:

Loss of precision in the result of adding numbers is not different when dealing with very small numbers from processing normal-size numbers. What may be relevant is: a) are the RELATIVE differences in size between the numbers large? b) have the numbers different SIGNS?

The last issue is usually at stake with addition-precision. What you should do - maybe not completely optimal, but a fair shot, and easy to implement - is:

a) split them in subsets of positives and negatives respectively

b) sort each subset

Then

c) take the largest (in absolute size) from the two sets combined, and initialize your sum with that number, and remove it from its list

d) iteratively: whenever the current sum is positive, take the largest remaining negative and add it to the sum, and remove it from its list; whenever the current sum is negative, do likewise.

In this way you have a fair chance that you've (almost-)minimized the loss of precision to what is inherently unavoidable (given the presentation of numbers).