OpenMP drastic slowdown for specific thread number

I ran an OpenMP program to perform the Jacobi method, and it was working very well, 2 threads performed slightly over 2x 1 thread, and 4 threads 2x faster than 1 thread. I felt everything was working perfectly... until I reached exactly 20, 22, and 24 threads. I kept breaking it down until I had this simple program

#include <stdio.h>
#include <omp.h>

int main(int argc, char *argv[]) {
    int i, n, maxiter, threads, nsquared, execs = 0;
    double begin, end;

    if (argc != 4) {
        printf("4 args\n");
        return 1;
    }

    n = atoi(argv[1]);
    threads = atoi(argv[2]);
    maxiter = atoi(argv[3]);
    omp_set_num_threads(threads);
    nsquared = n * n;

    begin = omp_get_wtime();
    while (execs < maxiter) {

#pragma omp parallel for
        for (i = 0; i < nsquared; i++) {
            //do nothing
        }
        execs++;
    }
    end = omp_get_wtime();

    printf("%f seconds\n", end - begin);

    return 0;
}

And here is some output for different thread numbers:

./a.out 500 1 1000
    0.6765799 seconds

./a.out 500 8 1000
    0.0851808 seconds

./a.out 500 20 1000
    19.5467 seconds

./a.out 500 22 1000
    21.2296 seconds

./a.out 500 24 1000
    20.1268 seconds

./a.out 500 26 1000
    0.1363 seconds

I would understand a big slowdown if it continued for all threads following 20, because I would figure that would be the thread overhead (though I felt it was a bit extreme). But even changing n leaves the times of 20, 22, and 24 to remain the same. Changing maxiter to 100 does scale it down to about 1.9 seconds, 2.2 seconds, ..., meaning the thread creation alone is causing the slowdown, not the internal iteration.

Is this something to do with the OS attempting to create threads it doesn't have? If it means anything, omp_get_num_procs() returns 24, and it is on Intel Xeon processors (so the 24 includes hyper-threading?)

Thanks for the help.

I suspect the problem is due to one thread running at 100% on one core. Due to hyper-threading this is really consuming two threads. You need to find the core that is causing this and try and exclude it. Let's assume it's threads 20 and 21 (you said it starts at 20 in your question - are you sure about this?). Try something like this

GOMP_CPU_AFFINITY = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 22 23

I have never used this before so you might need to read up on this a bit to get it right. OpenMP and CPU affinity You might need to list the even threads first and then odd (e.g. 0 2 4 ... 22 1 3 5 ...) in which case I'm not sure what to exclude (Edit: the solution was: export GOMP_CPU_AFFINITY="0-17 20-24. See the comments).

As to why 26 threads would not have the problem I can only guess. OpenMP can choose to migrate the threads to different cores. Your system can run 24 logical threads. I have never found a reason to set the number of threads to a value larger than the number of logical threads (in fact in my matrix multiplication code I set the number of threads to the number of physical cores since hyper-threading gives a worse result). Maybe when you set the number of threads to a value larger than the number of logical cores OpenMP decides it's okay to migrate threads when it chooses. If it migrates your threads away from the core running at 100% then the problem could go away. You might be able to test this by disabling thread migration with OMP_PROC_BIND