I ran an OpenMP program to perform the Jacobi method, and it was working very well, 2 threads performed slightly over 2x 1 thread, and 4 threads 2x faster than 1 thread. I felt everything was working perfectly... until I reached exactly 20, 22, and 24 threads. I kept breaking it down until I had this simple program
#include <stdio.h>
#include <omp.h>
int main(int argc, char *argv[]) {
int i, n, maxiter, threads, nsquared, execs = 0;
double begin, end;
if (argc != 4) {
printf("4 args\n");
return 1;
}
n = atoi(argv[1]);
threads = atoi(argv[2]);
maxiter = atoi(argv[3]);
omp_set_num_threads(threads);
nsquared = n * n;
begin = omp_get_wtime();
while (execs < maxiter) {
#pragma omp parallel for
for (i = 0; i < nsquared; i++) {
//do nothing
}
execs++;
}
end = omp_get_wtime();
printf("%f seconds\n", end - begin);
return 0;
}
And here is some output for different thread numbers:
./a.out 500 1 1000
0.6765799 seconds
./a.out 500 8 1000
0.0851808 seconds
./a.out 500 20 1000
19.5467 seconds
./a.out 500 22 1000
21.2296 seconds
./a.out 500 24 1000
20.1268 seconds
./a.out 500 26 1000
0.1363 seconds
I would understand a big slowdown if it continued for all threads following 20, because I would figure that would be the thread overhead (though I felt it was a bit extreme). But even changing n leaves the times of 20, 22, and 24 to remain the same. Changing maxiter to 100 does scale it down to about 1.9 seconds, 2.2 seconds, ..., meaning the thread creation alone is causing the slowdown, not the internal iteration.
Is this something to do with the OS attempting to create threads it doesn't have? If it means anything, omp_get_num_procs()
returns 24, and it is on Intel Xeon processors (so the 24 includes hyper-threading?)
Thanks for the help.