Assume that we have an array of small (about 10^(-15)
) double numbers in c++. If we calculate the sum of numbers in this array sequentially, for example
double sum = 0;
for (int i = 0; i < n; i++) sum+=array[i];
we get some value x
.
But if we divide an array into some parts and then calculate the sum in each part and after this we add all the partial sums together we get some value x2
, which is close to x
but not exactly x
. So I have lost accruacy in calculating sum.
Does someone know how to calculate the sum of small double numbers by partitioning these numbers into some parts without loosing accuracy?
Consider to apply Kahan summation algorithm for both your entire set or each of your subsets.
There are other questions referencing this algorithm that can help you
It could be that your individual summations are being optimised and performed in register at 80 bits but then transfered back to 64 doubles (read about compiler switches). Naturally this would lose precision. If this is the case then breaking up the array and adding the individual 64-bit sums would give a different answer to adding them all as 80-bit aand converting the grand total back.
This may not be the reason but it might be worth researching further. Look at the chosen answer to this question
Loss of precision in the result of adding numbers is not different when dealing with very small numbers from processing normal-size numbers. What may be relevant is: a) are the RELATIVE differences in size between the numbers large? b) have the numbers different SIGNS?
The last issue is usually at stake with addition-precision. What you should do - maybe not completely optimal, but a fair shot, and easy to implement - is:
a) split them in subsets of positives and negatives respectively
b) sort each subset
Then
c) take the largest (in absolute size) from the two sets combined, and initialize your sum with that number, and remove it from its list
d) iteratively: whenever the current sum is positive, take the largest remaining negative and add it to the sum, and remove it from its list; whenever the current sum is negative, do likewise.
In this way you have a fair chance that you've (almost-)minimized the loss of precision to what is inherently unavoidable (given the presentation of numbers).
Binary floating point numbers used to represent decimal numbers have more precision than accuracy. You have found one way of surfacing the difference.
Using Kahan Summation:
Output:
Code here.
The trick in those cases is to first order the array from smaller to higher, and then sum then in the cycle you've made. That way, the accuracy is best.
You can also check Kahan summation algorithm