Thread safety while looping with OpenMP

2020-04-30 16:54发布

问题:

I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:

*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1]    28163 abort (core dumped)  ./collatz

This is the code to reproduce the behaviour.

 #include <iostream>
 #include <gmpxx.h>

 mpz_class collatz(mpz_class n) {
     if (mpz_odd_p(n.get_mpz_t())) {
         n *= 3;
         n += 1;
     } else {
         n /= 2;
     }
     return n;
 }

 int main() {
     mpz_class x = 1;
 #pragma  omp parallel
     while (true) {
         //std::cout << x.get_str(10);
         while (true) {
             if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
             x = collatz(x);
         }
         x++;
         //std::cout << " OK" << std::endl;
     }
 }

Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.

Am I correct in my assumptions? How can I fix this and make it safe to run?

回答1:

I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.

if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;

Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:

void check_collatz(mpz_class n) {
    while (n != 1) {
        n = collatz(n);
    }
}

int main() {
    mpz_class x = 1;
    while (true) {
        std::cout << x.get_str(10);
        check_collatz(x);
        x++;
    }
}

The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:

for (mpz_class x = 1;; x++) {
    check_collatz(x);
}

Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:

#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
    check_collatz(check);
}

Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.

Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:

#include <gmpxx.h>
#include <iostream>
#include <limits>

// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
    if (mpz_odd_p(n.get_mpz_t()))
    {
        n *= 3;
        n += 1;
    }
    else
    {
        n /= 2;
    }
}

int main()
{
#pragma omp parallel
    {
        mpz_class x;
#pragma omp for
        for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
        {
            // Note: The structure of this fits perfectly in a for loop.
            for (x = check; x != 1; collatz(x));
        }
    }
}

Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.

You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.

You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.

Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.

Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:

#pragma omp parallel
    {
        // Note this is not a "parallel" loop
        // these are just separate loops on distinct strided 
        int nthreads = omp_num_threads();
        mpz_class check = 1;
        // we already checked those in the other program
        check += std::numeric_limits<long>::max(); 
        check += omp_get_thread_num();
        mpz_class x;
        for (; ; check += nthreads)
        {
            // Note: The structure of this fits perfectly in a for loop.
            for (x = check; x != 1; collatz(x));
        }
    }


回答2:

You could well be right about collisions with x. You can mark x as private by:

#pragma omp parallel private(x)

This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.



回答3:

You might want to touch x only with atomic instructions.

#pragma omp atomic
x++;

This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.