Precise sum of floating point numbers

2019-01-14 22:03发布

问题:

I am aware of a similar question, but I want to ask for people opinion on my algorithm to sum floating point numbers as accurately as possible with practical costs.

Here is my first solution:

put all numbers into a min-absolute-heap. // EDIT as told by comments below
pop the 2 smallest ones.
add them.
put the result back into the heap.
continue until there is only 1 number in the heap.

This one would take O(n*logn) instead of normal O(n). Is that really worth it?

The second solution comes from the characteristic of the data I'm working on. It is a huge list of positive numbers with similar order of magnitude.

a[size]; // contains numbers, start at index 0
for(step = 1; step < size; step<<=1)
    for(i = step-1; i+step<size; i+=2*step)
        a[i+step] += a[i];
    if(i < size-1)
        a[size-1] += a[i];

The basic idea is to do sum in a 'binary tree' fashion.

Note: it's a pseudo C code. step<<=1 means multiply step by 2. This one would take O(n). I feel like there might be a better approach. Can you recommend/criticize?

回答1:

Kahan's summation algorithm is significantly more precise than straightforward summation, and it runs in O(n) (somewhere between 1-4 times slower than straightforward summation depending how fast floating-point is compared to data access. Definitely less than 4 times slower on desktop hardware, and without any shuffling around of data).


Alternately, if you are using the usual x86 hardware, and if your compiler allows access to the 80-bit long double type, simply use the straightforward summation algorithm with the accumulator of type long double. Only convert the result to double at the very end.


If you really need a lot of precision, you can combine the above two solutions by using long double for variables c, y, t, sum in Kahan's summation algorithm.



回答2:

If you are concerned about reducing the numerical error in your summation then you may be interested in Kahan's algorithm.



回答3:

My guess is that your binary decomposition will work almost as well as Kahan summation.

Here is an example to illustrate it:

#include <stdio.h>
#include <stdlib.h>
#include <algorithm>

void sumpair( float *a, float *b)
{
    volatile float sum = *a + *b;
    volatile float small = sum - std::max(*a,*b);
    volatile float residue = std::min(*a,*b) - small;
    *a = sum;
    *b = residue;
}

void sumpairs( float *a,size_t size, size_t stride)
{
    if (size <= stride*2 ) {
        if( stride<size )
            sumpair(a+i,a+i+stride);
    } else {
        size_t half = 1;
        while(half*2 < size) half*=2;;
        sumpairs( a , half , stride );
        sumpairs( a+half , size-half , stride );
    }
}

void sumpairwise( float *a,size_t size )
{
    for(size_t stride=1;stride<size;stride*=2)
        sumpairs(a,size,stride);
}

int main()
{
    float data[10000000];
    size_t size= sizeof data/sizeof data[0];
    for(size_t i=0;i<size;i++) data[i]=((1<<30)*-1.0+random())/(1.0+random());

    float naive=0;
    for(size_t i=0;i<size;i++) naive+=data[i];
    printf("naive      sum=%.8g\n",naive);

    double dprec=0;
    for(size_t i=0;i<size;i++) dprec+=data[i];
    printf("dble prec  sum=%.8g\n",(float)dprec);

    sumpairwise( data , size );
    printf("1st approx sum=%.8g\n",data[0]);
    sumpairwise( data+1 , size-1);
    sumpairwise( data , 2 );
    printf("2nd approx sum=%.8g\n",data[0]);
    sumpairwise( data+2 , size-2);
    sumpairwise( data+1 , 2 );
    sumpairwise( data , 2 );
    printf("3rd approx sum=%.8g\n",data[0]);
    return 0;
}

I declared my operands volatile and compiled with -ffloat-store to avoid extra precision on x86 architecture

g++  -ffloat-store  -Wl,-stack_size,0x20000000 test_sum.c

and get: (0.03125 is 1ULP)

naive      sum=-373226.25
dble prec  sum=-373223.03
1st approx sum=-373223
2nd approx sum=-373223.06
3rd approx sum=-373223.06

This deserve a little explanation.

  • I first display naive summation
  • Then double precision summation (Kahan is roughly equivalent to that)
  • The 1st approximation is the same as your binary decomposition. Except that I store the sum in data[0] and that I care of storing residues. This way, the exact sum of data before and after summation is unchanged
  • This enables me to approximate the error by summing the residues at 2nd iteration in order to correct the 1st iteration (equivalent to applying Kahan on binary summation)
  • By iterating further I can further refine the result and we see a convergence


回答4:

The elements will be put into the heap in increasing order, so you can use two queues instead. This produces O(n) if the numbers are pre-sorted.

This pseudocode produces the same results as your algorithm and runs in O(n) if the input is pre-sorted and the sorting algorithm detects that:

Queue<float> leaves = sort(arguments[0]).toQueue();
Queue<float> nodes = new Queue();

popAny = #(){
       if(leaves.length == 0) return nodes.pop();
  else if(nodes.length == 0) return leaves.pop();
  else if(leaves.top() > nodes.top()) return nodes.pop();
  else return leaves.pop();
}

while(leaves.length>0 || nodes.length>1) nodes.push(popAny()+popAny());

return nodes.pop();