Manual synchronization in OpenMP while loop

2019-07-03 18:25发布

问题:

I recently started working with OpenMP to do some 'research' for an project in university. I have a rectangular and evenly spaced grid on which I'm solving a partial differential equation with an iterative scheme. So I basically have two for-loops (one in x- and y-direction of the grid each) wrapped by a while-loop for the iterations.

Now I want to investigate different parallelization schemes for this. The first (obvious) approach was to do a spatial a parallelization on the for loops. Works fine too.

The approach I have problems with is a more tricky idea. Each thread calculates all grid points. The first thread starts solving the equation at the first grid row (y=0). When it's finished the thread goes on with the next row (y=1) and so on. At the same time thread #2 can already start at y=0, because all the necessary information are already available. I just need to do a kind of a manual synchronization between the threads so they can't overtake each other.

Therefore I used an array called check. It contains the thread-id that is currently allowed to work on each grid row. When the upcoming row is not 'ready' (value in check[j] is not correct), the thread goes into an empty while-loop, until it is.

Things will get clearer with a MWE:

#include <stdio.h>
#include <math.h>
#include <omp.h>

int main()
{
    // initialize variables
    int iter = 0;                       // iteration step counter
    int check[100] = { 0 };             // initialize all rows for thread #0

    #pragma omp parallel num_threads(2)
    {
        int ID, num_threads, nextID;
        double u[100 * 300] = { 0 };

        // get parallelization info
        ID = omp_get_thread_num();
        num_threads = omp_get_num_threads();

        // determine next valid id
        if (ID == num_threads - 1) nextID = 0;
        else nextID = ID + 1;

        // iteration loop until abort criteria (HERE: SIMPLIFIED) are valid 
        while (iter<1000)
        {
            // rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
            for (int j = 1; j < (100 - 1); j++)
            {
                // manual sychronization: wait until previous thread completed enough rows
                while (check[j + 1] != ID)
                {
                    //printf("Thread #%d is waiting!\n", ID);
                }

                // gridpoints in row j
                for (int i = 1; i < (300 - 1); i++)
                {
                    // solve PDE on gridpoint
                    // replaced by random operation to consume time
                    double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
                }

                // update of check array in atomic to avoid race condition
                #pragma omp atomic write
                {
                    check[j] = nextID;
                }
            }// for j

            #pragma omp atomic write
            check[100 - 1] = nextID;

            #pragma omp atomic
            iter++;

            #pragma omp single
            {
                printf("Iteration step: %d\n\n", iter);
            }
        }//while
    }// omp parallel
}//main

The thing is, this MWE actually works on my machine. But if I copy it into my project, it doesn't. Additionally the outcome is always different: It stops either after the first iteration or after the third.

Another weird thing: when I remove the slashes of the comment in the inner while-loop it works! The output contains some

"Thread #1 is waiting!"

but that's reasonable. To me it looks like I created somehow a race condition, but I don't know where.

Does somebody has an idea what the problem could be? Or a hint how to realize this kind of synchronization?

回答1:

I think you are mixing up atomicity and memory consitency. The OpenMP standard actually describes it very nicely in

1.4 Memory Model (emphasis mine):

The OpenMP API provides a relaxed-consistency, shared-memory model. All OpenMP threads have access to a place to store and to retrieve variables, called the memory. In addition, each thread is allowed to have its own temporary view of the memory. The temporary view of memory for each thread is not a required part of the OpenMP memory model, but can represent any kind of intervening structure, such as machine registers, cache, or other local storage, between the thread and the memory. The temporary view of memory allows the thread to cache variables and thereby to avoid going to memory for every reference to a variable.

1.4.3 The Flush Operation

The memory model has relaxed-consistency because a thread’s temporary view of memory is not required to be consistent with memory at all times. A value written to a variable can remain in the thread’s temporary view until it is forced to memory at a later time. Likewise, a read from a variable may retrieve the value from the thread’s temporary view, unless it is forced to read from memory. The OpenMP flush operation enforces consistency between the temporary view and memory.

To avoid that, you should also make the read of check[] atomic and specify the seq_cst clause to your atomic constructs. This clause forces an implicit flush to the operation. (It is called a sequentially consistent atomic construct)

int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
    #pragma omp atomic read
    c = check[j + 1];
} while (c != ID);

Disclaimer: I can't really try the code right now.

Furhter Notes:

I think the iter stop criteria is bogus, the way you use it, but I guess that's irrelevant given that it is not your actual criteria.

I assume this variant will perform worse than the spatial decomposition. You loose a lot of data locality, especially on NUMA systems. But of course it is fine to try and measure.

There seems to be a discrepancy between your code (using check[j + 1]) and your description "At the same time thread #2 can already start at y=0"