How to correctly benchmark a [templated] C++ progr

2019-03-27 10:46发布

< backgound>

I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.

< /background>

Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less). Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.

However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.

Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).

The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.

If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.

So how can I do a correct benchmark of a little feature extracted from a library ? Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?

Thanks for advices !


There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.

I think I will try all of these. Thanks a lot for all your answers!

11条回答
兄弟一词,经得起流年.
2楼-- · 2019-03-27 11:07

Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.

Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.

查看更多
干净又极端
3楼-- · 2019-03-27 11:08

at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;

The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.

查看更多
放荡不羁爱自由
4楼-- · 2019-03-27 11:12

I don't know if GCC has a similar feature, but with VC++ you can use:

#pragma optimize

to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.

查看更多
够拽才男人
5楼-- · 2019-03-27 11:14

If this is possible for you, you might try splitting your code into:

  • the library you want to test compiled with all optimizations turned on
  • a test program, dinamically linking the library, with optimizations turned off

Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).

查看更多
放荡不羁爱自由
6楼-- · 2019-03-27 11:22

If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.

template<typename T> void sink(T const& t) {
   volatile T sinkhole = t;
}

No iostream overhead, just a copy that has to remain in the generated code. Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.

查看更多
登录 后发表回答