unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory"
);
mfence
in the above code, is it necessary?
Based on my test, cpu reorder is not found.
The fragment of test code is included below.
inline uint64_t clock_cycles() {
unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"rdtsc" : "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
unsigned t1 = clock_cycles();
unsigned t2 = clock_cycles();
assert(t2 > t1);
What you need to perform a sensible measurement with
rdtsc
is a serializing instruction.As it is well known, a lot of people use
cpuid
beforerdtsc
.rdtsc
needs to be serialized from above and below (read: all instructions before it must be retired and it must be retired before the test code starts).Unfortunately the second condition is often neglected because
cpuid
is a very bad choice for this task (it clobbers the output ofrdtsc
).When looking for alternatives people think that instructions that have a "fence" in their names will do, but this is also untrue. Straight from Intel:
An instruction that is almost serializing and will do in any measurement where previous stores don't need to complete is
lfence
.Simply put,
lfence
makes sure that no new instructions start before any prior instruction completes locally. See this answer of mine for a more detailed explanation on locality.It also doesn't drain the Store Buffer like
mfence
does and doesn't clobbers the registers likecpuid
does.So
lfence / rdtsc / lfence
is a better crafted sequence of instructions thanmfence / rdtsc
, wheremfence
is pretty much useless unless you explicitly want the previous stores to be completed before the test begins/ends (but not beforerdstc
is executed!).If your test to detect reordering is
assert(t2 > t1)
then I believe you will test nothing.Leaving out the
return
and the call that may or may not prevent the CPU from seeing the secondrdtsc
in time for a reorder, it is unlikely (though possible!) that the CPU will reorder twordtsc
even if one is right after the other.Imagine we have a
rdtsc2
that is exactly likerdtsc
but writesecx:ebx
1.Executing
is highly likely that
ecx:ebx > edx:eax
because the CPU has no reason to executerdtsc2
beforerdtsc
.Reordering doesn't mean random ordering, it means look for other instruction if the current one cannot be executed.
But
rdtsc
has no dependency on any previous instruction, so it's unlikely to be delayed when encountered by the OoO core.However peculiar internal micro-architectural details may invalidate my thesis, hence the likely word in my previous statement.
1 We don't need this altered instruction: register renaming will do it, but in case you are not familiar with it, this will help.
mfence is there to force serialization in CPU before rdtsc.
Usually you will find cpuid there (which is also serializing instruction).
Quote from Intel manuals about using rdtsc will make it clearer
TL;DR version - without serializing instruction before rdtsc you have no idea when that instruction started to execute making measurements possibly incorrect.
HINT - use rdtscp when possible.
Still no guarantee that it may happen - that's why original code had
"memory"
to indicate possible memory clobber preventing compiler from reordering it.