Best way to test code speed in C++ without profile

2019-03-16 01:58发布

On SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are quite a few issues involved and most Q & A ignore all but a few at a time, or don't justify their proposals.

What Im wondering about. If I have two functions that do the same thing, and Im curious about the difference in speed, does it make sense to test this without external tools, with timers, or will this compiled in testing affect the results to much?

I ask this because if it is sensible, as a C++ programmer, I want to know how it should best be done, as they are much simpler than using external tools. If it makes sense, lets proceed with all the possible pitfalls:

Consider this example. The following code shows 2 ways of doing the same thing:

#include <algorithm>
#include <ctime>
#include <iostream>

typedef unsigned char byte;

inline
void
swapBytes( void* in, size_t n )
{
   for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi )

      in[lo] ^= in[hi]
   ,  in[hi] ^= in[lo]
   ,  in[lo] ^= in[hi] ;
}

int
main()
{
         byte    arr[9]     = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };
   const int     iterations = 100000000;
         clock_t begin      = clock();

   for( int i=iterations; i!=0; --i ) 

      swapBytes( arr, 8 );

   clock_t middle = clock();

   for( int i=iterations; i!=0; --i ) 

      std::reverse( arr, arr+8 );

   clock_t end = clock();

   double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC;
   double secReve = (double) ( end-middle   ) / CLOCKS_PER_SEC;


   std::cout << "swapBytes,    for:    "   << iterations << " times takes: " << middle-begin
             << " clock ticks, which is: " << secSwap    << "sec."           << std::endl;

   std::cout << "std::reverse, for:    "   << iterations << " times takes: " << end-middle
             << " clock ticks, which is: " << secReve    << "sec."           << std::endl;

   std::cin.get();
   return 0;
}

// Output:

// Release:
//  swapBytes,    for: 100000000 times takes: 3000 clock ticks, which is: 3sec.
//  std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec.

// Debug:
//  swapBytes,    for: 10000000 times takes: 1781  clock ticks, which is: 1.781sec.
//  std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.

The issues:

  1. Which timers to use and how get the cpu time actually consumed by the code under question?
  2. What are the effects of compiler optimization (since these functions just swap bytes back and forth, the most efficient thing is obviously to do nothing at all)?
  3. Considering the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If yes, can you explain how std::reverse gets to be so fast, considering the simplicity of the custom function. I don't have the source code from the vc++ version that I used for this test, but here is the implementation from GNU. It boils down to the function iter_swap, which is completely incomprehensible for me. Would this also be expected to run twice as fast as that custom function, and if so, why?

Contemplations:

  1. It seems two high precision timers are being proposed: clock() and QueryPerformanceCounter (on windows). Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements. This page on the gnu c library seems to contradict that, but when I put a breakpoint in vc++, the debugged process gets a lot of clock ticks even though it was suspended (I have not tested under gnu). Am I missing alternative counters for this, or do we need at least special libraries or classes for this? If not, is clock good enough in this example or would there be a reason to use the QueryPerformanceCounter?

  2. What can we know for certain without debugging, dissassembling and profiling tools? Is anything actually happening? Is the function call being inlined or not? When checking in the debugger, the bytes do actually get swapped, but I'd rather know from theory why, than from testing.

Thanks for any directions.

update

Thanks to a hint from tojas the swapBytes function now runs as fast as the std::reverse. I had failed to realize that the temporary copy in case of a byte must be only a register, and thus is very fast. Elegance can blind you.

inline
void
swapBytes( byte* in, size_t n )
{
   byte t;

   for( int i=0; i<7-i; ++i )
    {
        t       = in[i];
        in[i]   = in[7-i];
        in[7-i] = t;
    }
}

Thanks to a tip from ChrisW I have found that on windows you can get the actual cpu time consumed by a (read:your) process trough Windows Management Instrumentation. This definitely looks more interesting than the high precision counter.

8条回答
一夜七次
2楼-- · 2019-03-16 02:36

I would suppose that anyone competent enough to answer all your questions is gong to be far too busy to answer all your questions. In practice it is probably more effective to ask a single, well-defined questions. That way you may hope to get well-defined answers which you can collect and be on your way to wisdom.

So, anyway, perhaps I can answer your question about which clock to use on Windows.

clock() is not considered a high precision clock. If you look at the value of CLOCKS_PER_SEC you will see it has a resolution of 1 millisecond. This is only adequate if you are timing very long routines, or a loop with 10000's of iterations. As you point out, if you try and repeat a simple method 10000's of times in order to get a time that can be measured with clock() the compiler is liable to step in and optimize the whole thing away.

So, really, the only clock to use is QueryPerformanceCounter()

查看更多
再贱就再见
3楼-- · 2019-03-16 02:42

To answer you main question, it "reverse" algorithm just swaps elements from the array and not operating on the elements of the array.

查看更多
迷人小祖宗
4楼-- · 2019-03-16 02:43

Is there something you have against profilers? They help a ton. Since you are on WinXP, you should really give a trial of vtune a try. Try a call graph sampling test and look at self time and total time of the functions being called. There's no better way to tune your program so that it's the fastest possible without being an assembly genius (and a truly exceptional one).

Some people just seem to be allergic to profilers. I used to be one of those and thought I knew best about where my hotspots were. I was often correct about obvious algorithmic inefficiencies, but practically always incorrect about more micro-optimization cases. Just rewriting a function without changing any of the logic (ex: reordering things, putting exceptional case code in a separate, non-inlined function, etc) can make functions a dozen times faster and even the best disassembly experts usually can't predict that without the profiler.

As for relying on simplistic timing tests alone, they are extremely problematic. That current test is not so bad but it's a very common mistake to write timing tests in ways in which the optimizer will optimize out dead code and end up testing the time it takes to do essentially a nop or even nothing at all. You should have some knowledge to interpret the disassembly to make sure the compiler isn't doing this.

Also timing tests like this have a tendency to bias the results significantly since a lot of them just involve running your code over and over in the same loop, which tends to simply test the effect of your code when all the memory in the cache with all the branch prediction working perfectly for it. It's often just showing you best case scenarios without showing you the average, real-world case.

Depending on real world timing tests is a little bit better; something closer to what your application will be doing at a high level. It won't give you specifics about what is taking what amount of time, but that's precisely what the profiler is meant to do.

查看更多
何必那么认真
5楼-- · 2019-03-16 02:46

Use QueryPerformanceCounter on Windows if you need a high-resolution timing. The counter accuracy depends on the CPU but it can go up to per clock pulse. However, profiling in real world operations is always a better idea.

查看更多
家丑人穷心不美
6楼-- · 2019-03-16 02:50

Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements.

I do two things, to ensure that wall-clock time and CPU time are approximately the same thing:

  • Test for a significant length of time, i.e. several seconds (e.g. by testing a loop of however many thousands of iterations)

  • Test when the machine is more or less relatively idle except for whatever I'm testing.

Alternatively if you want to measure only/more exactly the CPU time per thread, that's available as a performance counter (see e.g. perfmon.exe).

What can we know for certain without debugging, dissassembling and profiling tools?

Nearly nothing (except that I/O tends to be relatively slow).

查看更多
我欲成王,谁敢阻挡
7楼-- · 2019-03-16 02:54

Is it safe to say you're asking two questions?

  • Which one is faster, and by how much?

  • And why is it faster?

For the first, you don't need high precision timers. All you need to do is run them "long enough" and measure with low precision timers. (I'm old-fashioned, my wristwatch has a stop-watch function, and it is entirely good enough.)

For the second, surely you can run the code under a debugger and single-step it at the instruction level. Since the basic operations are so simple, you will be able to easily see roughly how many instructions are required for the basic cycle.

Think simple. Performance is not a hard subject. Usually, people are trying to find problems, for which this is a simple approach.

查看更多
登录 后发表回答