Reduction with OpenMP: linear merging or log(numbe

2019-03-02 02:53发布

I have a general question about reductions with OpenMP that's bothered me for a while. My question is in regards to merging the partial sums in a reduction. It can either be done linearly or as the log of the number of threads.

Let's assume I want to do a reduction of some function double foo(int i). With OpenMP I could do it like this.

double sum = 0.0;    
#pragma omp parallel for reduction (+:sum)
for(int i=0; i<n; i++) {
    sum += f(i);
}

However, I claim that the following code will be just as efficient.

double sum = 0.0;
#pragma omp parallel
{
    double sum_private = 0.0;
    #pragma omp for nowait
    for(int i=0; i<n; i++) {
        sum_private += f(i)
    }
    #pragma omp critical
    {
        sum += sum_private;
    }
}

Not, only will this second code case have effectively the same performance but it's more general. It can handle any operator I define whereas the reduction construct only works for some basic operators on plain old data types.

Let's assume there are t threads. The reason I claim this second method is just as fast is that the time to merge the partial sums is negligible compared to the parallel loop. The time to do the partial sums is proportional to n/t and the time to merge the sums goes as t. So as long as n>>t or the time it takes to do the parallel loop (if foo is slow compare to summing) is large enough the merging will be negligible.

I have heard it's possible to merge the partial sums in O(log(t)). However, for all practical purposes I don't see how this will help. The maximum number of physical cores in OpenMP is on order 50, let's assume it's 64. It won't make much difference to merge 64 values in 64 steps or in eight binary steps compared to doing the parallel loop. Additionally, merging the values in some kind of binary tree could have an overhead which is larger than just doing the linear merge so it's not even necessarily faster.

When would merging the partial sums in O(log(t)) ever help? When would the first code case ever have a performance advantage over the second code case?

I know some coworkers who merge in O(log(t)) on the GPU with OpenCL (by running the kernel several times for each binary merge) but I have not seen any evidence yet to show it's better than just merging linearly.

Edit: Jim Cownie wanted to see an actual test rather than a claim. Below is the results and code. This was done with MSVC2012 64-bit release mode on a Xeon E5-1620 (Sandy Bridge) with four physical cores. Both the first and second case are about exactly 4.45x faster than without OpenMP.

The results:

without OpenMP time 1.787158 s
first case     time 0.400462 s
second case    time 0.400456 s

The code:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

double foo(int i) {
    double fi = i;
    return 1.0*fi/(1+fi*fi);
}

double reduce(int n) {
    double sum = 0.0f;
    for(int i=0; i<n; i++) {
        sum += foo(i);
    }
    return sum;
}

double reduce_omp(int n) {
    double sum = 0.0f;
    #pragma omp parallel for reduction(+:sum)
    for(int i=0; i<n; i++) {
        sum += foo(i);
    }
    return sum;
}

double reduce_omp2(int n) {
    double sum = 0.0f;
    #pragma omp parallel 
    {
        double sum_private = 0.0f;
        #pragma omp for nowait
        for(int i=0; i<n; i++) {
            sum_private += foo(i);
        }
        #pragma omp critical 
        {
            sum+= sum_private;
        }
    }
    return sum;
}

int main() {
    int n,r;
    double sum, dtime;
    n = 1<<28;
    r = 1;

    dtime = omp_get_wtime();
    for(int i=0; i<r; i++) sum = reduce(n);
    dtime = omp_get_wtime() - dtime;
    printf("time %f, sum %f\n", dtime, sum);

    reduce_omp(n);  //warm omp up

    dtime = omp_get_wtime();
    for(int i=0; i<r; i++) sum = reduce_omp(n);
    dtime = omp_get_wtime() - dtime;
    printf("time %f, sum %f\n", dtime, sum);

    dtime = omp_get_wtime();
    for(int i=0; i<r; i++) sum = reduce_omp2(n);
    dtime = omp_get_wtime() - dtime;
    printf("time %f, sum %f\n", dtime, sum);


}

1条回答
甜甜的少女心
2楼-- · 2019-03-02 03:25

The OpenMP implementation will make a decision about the best way to do the reduction based on the implementor's knowledge of the specific characteristics of the hardware it's running on. On system with a small number of CPUs, it will probably do a linear reduction. On a system with hundreds or thousands of cores (e.g. GPU, Intel Phi) it will likely do a log(n) reduction.

The time spent in the reduction might not matter for very large problems, but for smaller problems it could be add a few percent to the total runtime. Your implementation might be just as fast in many cases, but I doubt it would ever be faster, so why not let OpenMP decide on the optimal reduction strategy?

查看更多
登录 后发表回答