可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp_send_wait() and another 9.5% in gomp_send_post.

The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a a loop of chunk size 4. p ranges from 17 to 1000, depending on the size of the molecule we're simulating.

My numbers are for the worst case, when p is 17 and the chunk size 4. The performance is the same whether I'm using static, dynamic, or guided scheduling. With p=150 and chunk size 75, the program is still 75x-100x slower than serial.

...
    double e_t_sum=0.0;
    double e_in_sum=0.0;

    int nthreads,tid;

    #pragma omp parallel for schedule(static, 4) reduction(+ : e_t_sum, e_in_sum) shared(ee_t) private(tid, i, d_x, d_y, d_z, rr,) firstprivate( V_in, t_x, t_y, t_z) lastprivate(nthreads)
    for (i = 0; i < p; i++){
        if (i != c){
            nthreads = omp_get_num_threads();               
            tid = omp_get_thread_num();

            d_x = V_in[i].x - t_x; 
            d_y = V_in[i].y - t_y;
            d_z = V_in[i].z - t_z;


            rr = d_x * d_x + d_y * d_y + d_z * d_z;

            if (i < c){

                ee_t[i][c] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
                e_t_sum += ee_t[i][c]; 
                e_in_sum += ee_in[i][c];    
            }
            else{

                ee_t[c][i] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
                e_t_sum += ee_t[c][i]; 
                e_in_sum += ee_in[c][i];    
            }

            // if(pid==0){printf("e_t_sum[%d]: %f\n", tid, e_t_sum[tid]);}

        }
    }//end parallel for 


        e_t += e_t_sum;
        e_t -= e_in_sum;            

...

回答1:

First, I don't think optimizing your serial code in this case will help answer your OpenMP dilemna. Don't worry about it.

IMO there are three possible explanations for the slowdown:

This one can explain a slowdown easily. The elements of the array ee_t are leading to false sharing within the cache line. False sharing is when cores end up writing to the same cache line not becuase they are actually sharing data but when what the cores are writing just happens to be in the same cache line (which is why its called false sharing). I can explain more if you dont find false sharing on google. Making ee_t elements cache line aligned may help a lot.
The overhead of spawning work is higher than the parallelism benefit. Have you tried fewer than 8 cores? How is performance at 2 cores?
The total number of iterations is small, say we take 17 as an example. If you split it across 8 cores, it will suffer load imbalance problems (specially since some of your iterations are practically not doing any work (when i == c). At least one core will have to do 3 iterations, while all others will do 2. This does not explain a slow down but surely one reason why speedup is not as high as you may expect. Since your iterations are of varying lengths, I would use a dynamic schedule with a chunk size of 1 or use openmp guided. Experiment with the chunk size, a chunk too small will also lead to slowdown.

Let me know how it goes.

回答2:

You seem to think that it is a given that if you run a serial code in a multitheaded mode it has to perform better. That is not a given. And, it's often not true. Parallelizing a loop to run in multiple threads or multiple processors does not always result in better performance. In most cases, some restructuring is needed. In your case the code isn't even good serial code.

Any book on serial code optimization has as rule number 1 for loops: remove all conditional operations. Tests cost. Some compilers (by the way, you never say what OS/compiler/processor you're using .. it does matter) can try to optimize over conditional code. Some compilers (like Sun's C compiler) even let you run the program in "collect" mode where it generates runtime profile information about how often the branches of a conditional are taken and then let you re-compile in a mode that uses that collected data to optimize the generated code. (See the -xprofile option)

The first rule for optimizing parallel code is first do the best serial optimization you can. Then parallelize the loops.

By moving the conditionals outside the loop and, as Metiu suggests, rewriting the code as two separate loops, you give the optimizer a better source to work with. The serial code runs better, and the parallelized code is embarrassingly parallel.

Still, results may vary depending on OS/compiler/platform.

See Using OpenMP and Solaris Application Programming

回答3:

I believe you should try moving out all those branches (i.e. ifs) inside the loop, and do it in two separate loops, one for i < c, and one for i > c. This would greatly benefit even the single-threaded code, but it should give you more parallelism, even if as you said the thread creation overhead may be larger than the benefits for small n.

回答4:

Metiu is right. You can't expect good performance from a loop that has conditional statements in it. This is just bad coding. Even for scalar performance.

Your boss needs to understand that OpenMP and parallelization in general are not magic. To get good performance out of a parallelized code requires that the basic code be optimized for scalar performance first.

The tests do not have to be removed. The loop needs to be restructured. And scalar performance will benefit also.

回答5:

First, try pumping the chunk size larger still. Thread creation carries overhead, picking up new work for each thread does too, and the grain-size needs to be large enough to overwhelm that.

One bigger possibility: GOMP's implementation of the reduction may be very poor (suggested by your profile data), and it generates messages after each block, rather than accumulating within each thread and then collecting them at the end. Try allocating e_t_sum and e_in_sum as arrays with nthreads elements each, and adding to e_t_sum[tid] within the loop, then looping over them to compute the global sum after the parallel loop is finished.

Note that this introduces a potential false sharing issue in that multiple elements of these arrays will lie within common cachelines, and multiple processors will be writing to this same cacheline. If you run this on a set of cores that share their cache, this will be fine, but will stink elsewhere.

Another possibility: You might be experiencing false sharing in your updates to the elements of ee_t. Ensure alignment of this array, and try chunk sizes that are multiples of your cache line size. One subtle hint at this pathology would be the part of the loop where i > c taking disproportionately longer than the part where i < c.

回答6:

This looks like an issue with the GNU compiler's openmp implementation. Try a different compiler. Intel has a Linux compiler which you should be able to download a copy of and try here.

Another thing I noticed is that the firstprivate variables that you have appear to be quite unnecessary. There might be significant overhead in making private copies of the array V_in which might be your problem.

I'd say it is one of those 2 issues that is your problem.