I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:
#include <vector>
#include <algorithm>
using namespace std;
int main ()
{
int n=400000, m=1000;
double x=0,y=0;
double s=0;
vector< double > shifts(n,0);
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}
I compile it with
g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include
that is, no "-fopenmp", and I get these timings:
real 0m18.417s
user 0m18.357s
sys 0m0.004s
when I do use "-fopenmp",
g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include
I get these numbers for the times:
real 0m6.853s
user 0m52.007s
sys 0m0.008s
which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?
You should make use of the OpenMP
reduction
clause forx
andy
:With
reduction
each thread accumulates its own partial sum inx
andy
and in the end all partial values are summed together in order to obtain the final values.See - superlinear speed-up :)
What you can achieve at most(!) is a linear speedup. Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.
Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.
let's try to understand how parallelize simple for loop using OpenMP
assume that we have
3
available threads, this is what will happenfirstly
and finally