Issues of gcc optimization with rdtsc

2020-06-07 05:03发布

问题:

I am using rdtsc and cpuid instructions (using volatile inline assembly instructions) to measure the CPU cycles of a program. The rdtsc instruction gives realistic results for my programs on Linux (with speed optimization -o2 -fomit-frame-pointer) and Windows (using speed optimization options C compiler for MS Visual Studio 2008 (I think its VC 9.0)).

Recently, I implemented a new program, which uses a lot of table-lookups and stuff like this. However, the rdtsc measurements of this program with gcc optimization on Linux always results in wrong measurements (very small number of CPU cycles) than I expect. The rdtsc measurements of the same program while running on Windows (compiled with optimizations and compiler I mentioned above) are realistic and agree to out expectations.

My question is there any way gcc optimization move the volatile assembly instructions some where to produce the above mentioned behaviour?

My code for the timers is given below:

#define TIMER_VARS                                                 \
  uint32 start_lo, start_hi;                                       \
  uint32 ticks_lo, ticks_hi

#define TIMER_START()                                              \
  __asm__ __volatile__                                             \
     ("rdtsc"                                                      \
     : "=a" (start_lo), "=d" (start_hi) /* a = eax, d = edx*/      \
     : /* no input parameters*/                                    \
     : "%ebx", "%ecx", "memory")

#define TIMER_STOP()                                               \
  __asm__ __volatile__                                             \
     ("rdtsc"                                                      \
     "\n        subl %2, %%eax"                                    \
     "\n        sbbl %3, %%edx"                                    \
     : "=&a" (ticks_lo), "=&d" (ticks_hi)                          \
     : "g" (start_lo), "g" (start_hi)                              \
     : "%ebx", "%ecx", "memory")

I would be very thankful if some body could suggest some ideas on this.

thanks,

回答1:

In order to prevent an inline rdtsc function from being moved across any loads/stores/other operations, you should both write the asm as __asm__ __volatile__ and include "memory" in the clobber list. Without doing the latter, GCC is prevented from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm, but it could still move it with respect to unrelated operations. The "memory" clobber means that GCC cannot make any assumptions about memory contents (any variable whose address has been potentially leaked) remaining the same across the asm, and thus it becomes much more difficult to move it. However, GCC may still be able to move the asm across instructions that only modify local variables whose address was never taken (since they are not "memory").

Oh, and as wildplasser said in a comment, check the asm output before you waste a lot of time on this.



回答2:

I don't know if it is(was) correct, but the code I once used was:

#define rdtscll(val) \
      __asm__ __volatile__("rdtsc" : "=A" (val))

typedef unsigned unsigned long long Ull;

static inline Ull myget_cycles (void)
{
Ull ret;

rdtscll(ret);
return ret; 
}

I remember it was "slower" on Intel than on AMD. YMMV.