unbalanced nested for loops in openmp

I've been trying to parallelize an algorithm with unbalanced nested for loops using OpenMP. I can't post the original code as it's a secret project of an unheard government but here's a toy example:

for (i = 0; i < 100; i++) {
    #pragma omp parallel for private(j, k)
    for (j = 0; j < 1000000; j++) {
        for (k = 0; k < 2; k++) {
            temp = i * j * k;      /* dummy operation (don't mind the race) */
        }
        if (i % 2 == 0) temp = 0;  /* so I can't use openmp collapse */
    }
}

Currently this example is working slower in multiple threads (~1 sec in single thread ~2.4 sec in 2 threads etc.).

Things to note:

Outer for loop needs to be done in order (dependent on the previous step) (As far as I know, OpenMP handles inner loops well so threads don't get created/destroyed at each step, right?)
Typical index numbers are given in the example (100, 1000000, 2)
Dummy operation consists of just a few operations
There are some conditional operations outside the inner most loop so collapse is not an option (doesn't seem like it would increase the performance anyways)

Looks like an embarrassingly parallel algorithm but I can't seem to get any speedups for the last two days. What would be the best strategy here?

标签： c multithreading parallel-processing multiprocessing openmp

2条回答

相关推荐>>

2楼-- · 2019-08-12 15:18

Unfortunately this embarrassingly parallel algorithm is an embarrassingly bad example of how performant parallelism should be implemented. And since my crystall ball tells me that besides i, temp is also a shared automatic variable, I would assume it for the rest of this text. It also tells me that you have a pre-Nehalem CPU...

There are two sources of slowdown here - code transformation and cache coherency.

The way parallel regions are implmentend is that their code is extracted in separate functions. Shared local variables are extracted into structures that are then shared between the threads in the team that executes the parallel region. Under the OpenMP transformations your code sample would become something similiar to this:

typedef struct {
    int i;
    int temp;
} main_omp_fn_0_shared_vars;

void main_omp_fn_0 (void *data) {
    main_omp_fn_0_shared_vars *vars = data;

    // compute values of j_min and j_max for this thread

    for (j = j_min; j < j_max; j++) {
        for (k = 0; k < 2; k++) {
            vars->temp = vars->i * j * k;
        if (vars->i % 2 == 0) vars->temp = 0;
    }
}

int main (void) {
    int i, temp;
    main_omp_fn_0_shared_vars vars;

    for (i = 0; i < 100; i++)
    {
        vars.i = i;
        vars.temp = temp;

        // This is how GCC implements parallel regions with libgomp

        // Start main_omp_fn_0 in the other threads
        GOMP_parallel_start(main_omp_fn_0, &vars, 0);

        // Start main_omp_fn_0 in the main thread
        main_omp_fn_0(&vars);

        // Wait for other threads to finish (implicit barrier)
        GOMP_parallel_end();

        i = vars.i;
        temp = vars.temp;
    }
}

You pay a small penalty for accessing temp and i this way as their intermediate values cannot be stored in registers but are loaded and stored each time.

The other source of degradation is the cache coherency protocol. Accessing the same memory location from multiple threads executing on multiple CPU cores leads to lots of cache invalidation events. Worse, vars.i and vars.temp are likely to end up in the same cache line and although vars.i is only read from and vars.temp is only written to, full cache invalidation is likely to occur at each iteration of the inner loop.

Normally access to shared variables is protected by explicit synchronisation constructs like atomic statements and critical sections and performance degradation is well expected in that case.

0人赞添加讨论(0) 举报

欢心

3楼-- · 2019-08-12 15:24

Think of the overheads:

Since your outer loop needs to be in order you're creating x threads to perform the work in the inner loop, destroying them, then creating them again... and so on 100 times.

You have to wait until the longest task within the inner loop completes its work before performing the next step in the outer loop, so essentially this is a synchronization overhead. The tasks don't look irregular, but if the work to perform is small there's only so much speedup you can get out of this.

You have the cost of thread creation here, and allocating the private variables.

If the work inside the inner loop is small the benefits of parallelising this loop might not necessarily outweigh the cost of the parallelisation overheads above, hence you end up with a slowdown.

0人赞添加讨论(0) 举报

unbalanced nested for loops in openmp

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间