Why does my code run slower with multiple threads

2019-04-02 03:53发布

问题:

I'm writing a ray tracer.

Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.

In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.

I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.

Host system: Ubuntu Linux 10.4 AMD64.

I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.

Any idea why I'm seeing this behavior?

Debug version is compiled with "-g3 -pg". Optimized version with "-O3".

Optimized no threading:        0m4.864s
Optimized threading:           0m2.075s

Debug no threading:            0m30.351s
Debug threading:               0m39.860s
Debug threading after "strip": 0m39.767s

Debug no threading (no-pg):    0m10.428s
Debug threading (no-pg):       0m4.045s

This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.

Since "-pg" is broken on threaded applications anyway, I'll just remove it.

回答1:

What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).

It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.



回答2:

You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.

Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.

Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.

If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.



回答3:

Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.



回答4:

Multithreaded code execution time is not always measured as expected by gprof. You should time your code with an other timer in addition to gprof to see the difference.

My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:

With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.

WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...

The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).

Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.