I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:
#include <vector>
#include <algorithm>
using namespace std;
int main ()
int n=400000, m=1000;
double x=0,y=0;
double s=0;
vector< double > shifts(n,0);
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
shifts[j] = r / m;
cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
I compile it with
g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include
that is, no "-fopenmp", and I get these timings:
real 0m18.417s
user 0m18.357s
sys 0m0.004s
when I do use "-fopenmp",
g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include
I get these numbers for the times:
real 0m6.853s
user 0m52.007s
sys 0m0.008s
which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?
You should make use of the OpenMP
clause forx
each thread accumulates its own partial sum inx
and in the end all partial values are summed together in order to obtain the final values.See - superlinear speed-up :)
What you can achieve at most(!) is a linear speedup. Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.
Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.
let's try to understand how parallelize simple for loop using OpenMP
assume that we have
available threads, this is what will happenfirstly
and finally