Parallel for loop in openmp

2019-01-22 03:29发布

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:

#include <vector>
#include <algorithm>

using namespace std;

int main () 
{
    int n=400000,  m=1000;  
    double x=0,y=0;
    double s=0;
    vector< double > shifts(n,0);


    #pragma omp parallel for 
    for (int j=0; j<n; j++) {

        double r=0.0;
        for (int i=0; i < m; i++){

            double rand_g1 = cos(i/double(m));
            double rand_g2 = sin(i/double(m));     

            x += rand_g1;
            y += rand_g2;
            r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
        }
        shifts[j] = r / m;
    }

    cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}

I compile it with

g++ -O3 testMP.cc -o testMP  -I /opt/boost_1_48_0/include

that is, no "-fopenmp", and I get these timings:

real    0m18.417s
user    0m18.357s
sys     0m0.004s

when I do use "-fopenmp",

g++ -O3 -fopenmp testMP.cc -o testMP  -I /opt/boost_1_48_0/include

I get these numbers for the times:

real    0m6.853s
user    0m52.007s
sys     0m0.008s

which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?

3条回答
混吃等死
2楼-- · 2019-01-22 03:33

You should make use of the OpenMP reduction clause for x and y:

#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {

    double r=0.0;
    for (int i=0; i < m; i++){

        double rand_g1 = cos(i/double(m));
        double rand_g2 = sin(i/double(m));     

        x += rand_g1;
        y += rand_g2;
        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
    }
    shifts[j] = r / m;
}

With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.

Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total

See - superlinear speed-up :)

查看更多
时光不老,我们不散
3楼-- · 2019-01-22 03:41

What you can achieve at most(!) is a linear speedup. Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.

Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.

查看更多
等我变得足够好
4楼-- · 2019-01-22 03:52

let's try to understand how parallelize simple for loop using OpenMP

#pragma omp parallel
#pragma omp for
    for(i = 1; i < 13; i++)
    {
       c[i] = a[i] + b[i];
    }

assume that we have 3 available threads, this is what will happen

enter image description here

firstly

  • Threads are assigned an independent set of iterations

and finally

  • Threads must wait at the end of work-sharing construct
查看更多
登录 后发表回答