So, I'm making matrix-vector product using openMP, but I've noticed it's working reallllly slow. After some times trying to figure out whats wrong I just deleted all code in parallel section and its still SLOW. What can be problem here? (n = 1000)
Here is time results for 1, 2 and 4 cores.
seq_method time = 0.001047194215062
parrallel_method (1) time = 0.001050273191140 seq - par = -0.000003078976079 seq/par = 0.997068404578433
parrallel_method (2) time = 0.001961992426004 seq - par = -0.000914798210943 seq/par = 0.533740192460558
parrallel_method (4) time = 0.004448095121916 seq - par = -0.003400900906854 seq/par = 0.235425319459132
Even when I delete code from parallel section - it doesnt change much.
void parallel_method(float A[n][n], float B[n], float C[n], int thr_num)
{
double t1, t2;
float tmp = 0;
int i, j;
t1 = omp_get_wtime();
omp_set_dynamic(0);
omp_set_num_threads(thr_num);
#pragma omp parallel for private(tmp, j, i)
for (i = 0; i < n; i++) {
tmp = 0;
for (j = 0; j < n; j++) {
tmp += A[i][j] * B[j];
}
#pragma omp atomic
C[i] += tmp;
}
//////
t2 = omp_get_wtime();
if (show_c) print_vector(C);
par = t2 - t1;
printf("\nparrallel_method (%d) time = %.15f", thr_num, par);
printf("\nseq - par = %.15f", seq - par);
printf("\nseq/par = %.15f\n", seq / par);
}
I tried to reproduce your problem and was not able to do that. I have a completely coherent behavior.
Except for small size, where thread overhead is significant, the results are more or less what is expected.
What I did:
all measures are done in the same run
I run all functions once without timing to warm-up the caches
In real code estimations, I would have also
time several successive executions of the same function, especially if time is short in order to reduce small variations
run several experiments and keep the smallest one to suppress outliers. (I prefer minimum, but you can also compute the average).
You should have posted all you code, and I do not know what is your methodology. But I think that your estimations where done in different runs and without warming up the caches. For this code, cache impact is very important and cores have to store the same information (B). And the problem is not large enough to benefit from the larger L1/L2 caches. These multiples loads may explain the worse performances of parallel code.
On last remark on your code. Every thread will have their own values of i. Hence C[i] can be accessed by only one thread and the atomic pragma is useless.