OpenMP: Huge performance differences between Visua

2020-02-17 07:09发布

问题:

I'm running a camera acquisition program that performs processing on acquired images, and I'm using simple OpenMP directives for this processing. So basically I wait for an image from the camera, and then process it.

When migrating to VC2010, I see very strange performance hog : under VC2010 my app is taking nearly 100% CPU while it is taking only 10% under VC2008.

If I benchmark only the processing code I get no difference between VC2010 and VC2008, the difference occurs when using the acquisition functions.

I have reduced the code needed to reproduce the problem to a simple loop that does the following:

  for (int i=0; i<1000; ++i)
  {
    GetImage(buffer);//wait for image
    Copy2Array(buffer, my_array);

    long long sum = 0;//do some simple OpenMP parallel loop
    #pragma omp parallel for reduction(+:sum)
    for (int j=0; j<size; ++j)
      sum += my_array[j];
  }

This loop eats 5% of CPU with 2008, and 70% with 2010.

I've done some profiling, that shows that in 2010 most of the time is spent in OpenMP's vcomp100.dll!_vcomp::PartialBarrierN::Block

I have also done some concurrency profiling:

In 2008, processing work is distributed over 3 worker threads, that are very lightly active as processing time is much inferior as image waiting time

The same threads appear in 2010, but they are all 100% occupied by the PartialBarrierN::Block function. As I have four cores, they are eating 75% of the work, which is roughly what I see in the CPU occupation.

So it looks like there is a conflict between OpenMP and the Matrox acquisition library (proprietary). But is it a bug of VS2010 or Matrox? Is there anything I can do? Using VC++2010 is mandatory for me, so I cannot just stick with 2008.

Big thanks

STATUS UPDATE

Using new concurrency framework, as suggested by DeadMG, leads to 40% CPU. Profiling it shows that time is spent in processing, so it doesn't show the bug I'm seeing with OpenMP, but performance in my case is way poorer than OpenMP.

STATUS UPDATE 2

I have installed an evaluation version of latest Intel C++. It shows exactly the same performance problems!!

I cross-posted to MSDN forum

STATUS UPDATE 3

Tested on Windows 7 64 bits and XP 32 bits, with the exact same results (on the same machinje)

回答1:

In 2010 OpenMP, each worker thread does a spin-wait of about 200 ms after task completion. In my case of a I/O wait and repetitive OpenMP task it is massively loading the CPU.

The solution is to change this behaviour; Intel C++ has an extension routine for this, kmp_set_blocktime(). However Visual 2010 doesn't have such possibility.

In this Autodesk note they talks about the problem for Intel C++. This compiler first introduced the behavior, but allows to change it (see above). Visual 2010 switched to it, but... without the workaround like Intel.

So to sum it up, switching to Intel C++ and using kmp_set_blocktime(0) solved it.

Thanks to John Lilley from DataLever Corporation on the other MSDN thread

Issue has been submitted to MS Connect, and received the "won't fix" feedback.



回答2:

With OpenMP 3.0 the spinwait can be deactivated via OMP_WAIT_POLICY:

_putenv_s( "OMP_WAIT_POLICY", "PASSIVE" );

The effect is basically the same as with kmp_set_blocktime(0), but as we set the environment variable OMP_WAIT_POLICY during runtime, it'll only affect the current process and child processes.

Of course OMP_WAIT_POLICY can also be set by a launcher application, e.g. Blender handles it that way.

A hotfix for VC2010 is available here, later versions like VC2013 support it directly.



回答3:

You could try the new Concurrency Runtime that ships with VS2010- just starting on your test sample.

That is,

for (int i=0; i<1000; ++i)
  {
    GetImage(buffer);//wait for image
    Copy2Array(buffer, my_array);

    long long sum = 0;//do some simple OpenMP parallel loop
    #pragma omp parallel for reduction(+:sum)
    for (int j=0; j<size; ++j)
      sum += my_array[j];
  }

would become

for (int i=0; i<1000; ++i)
  {
    GetImage(buffer);//wait for image
    Copy2Array(buffer, my_array);

    Concurrency::combinable<int> combint;
    Concurrency::parallel_for(0, size / 1000, [&](int j) {
      for(int i = 0; i < 1000; i++)
          combint.local() += my_array[(j * 1000) + i];
    });
    combint.combine([](int a, int b) { return a + b; });
  }


回答4:

I tested another acquisition board, and the problem is identical, so the culprit is VC++2010. Microsoft made OpenMP implementation changes that screws up programs like mine, as a thread on MSDN forums shows.