I'm looking into measuring benchmark performance using the time-stamp register (TSR) found in x86 CPUs. It's a useful register, since it measures in a monotonic unit of time which is immune to the clock
speed changing. Very cool.
Here is an Intel document showing asm snippets for reliably benchmarking using the TSR, including using cpuid for pipeline synchronisation. See page 16:
http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
To read the start time, it says (I annotated a bit):
__asm volatile (
"cpuid\n\t" // writes e[abcd]x
"rdtsc\n\t" // writes edx, eax
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
//
:"=r" (cycles_high), "=r" (cycles_low) // outputs
: // inputs
:"%rax", "%rbx", "%rcx", "%rdx"); // clobber
I'm wondering why scratch registers are used to take the values of edx
and eax
. Why not remove the movs and read the TSR value right out of edx
and eax
? Like this:
__asm volatile(
"cpuid\n\t"
"rdtsc\n\t"
//
: "=d" (cycles_high), "=a" (cycles_low) // outputs
: // inputs
: "%rbx", "%rcx"); // clobber
By doing this, you save two registers, reducing the likelihood of the C
compiler needing to spill.
Am I right? Or those MOVs are somehow strategic?
(I agree that you do need scratch registers to read the stop time, as
in that scenario the order of the instructions is reversed: you have
rdtscp, ..., cpuid. The cpuid instruction destroys the result of rdtscp).
Thanks
You're correct, the example is clunky. Usually if mov
is the first or last instruction in an inline-asm statement, you're doing it wrong, and should have used a constraint to tell the compiler where you want the input, or where the output is.
See my GNU C inline asm guides / links collection, and other links in the inline-assembly tag wiki. (The x86 tag wiki is full of good stuff for asm in general, too.)
Or for rdtsc
specifically, see Get CPU cycle count? for the __rdtsc()
intrinsic, and good inline asm in @Mysticial's answer.
it measures in a monotonic unit of time which is immune to the clock speed changing.
Yes, on CPUs made within the last 10 years or so.
For profiling, it's often more useful to have times in core clock cycles, not wall-clock time, so your microbenchmark results don't depend on power-saving / turbo. Performance counters can do this and much more.
Still, if real time is what you want, rdtsc
is the lowest-overhead way to get it.
And re: discussion in comments: yes cpuid
is there to serialize, making sure that rdtsc
and following instructions can't begin executing until after CPUID. You could put another CPUID after RDTSC, but that would increase measurement overhead, and I think give near-zero gain in accuracy / precision.
LFENCE is a cheaper alternative that's useful with RDTSC. The instruction ref manual entry documents the fact that it doesn't let later instructions start executing until it and previous instructions have retired (from the ROB/RS in the out-of-order part of the core). See Are loads and stores the only instructions that gets reordered?, and for a specific example of using it, see clflush to invalidate cache line via C function. Unlike true serializing instructions like cpuid
, it doesn't flush the store buffer.
(On recent AMD CPUs without Spectre mitigation enabled, lfence
is not even partially serializing, and runs at 4 per clock according to Agner Fog's testing. Is LFENCE serializing on AMD processors?)
Margaret Bloom dug up this useful link, which also confirms that LFENCE serializes RDTSC according to Intel's SDM, and has some other stuff about how to do serialization around RDTSC.
No, there doesn't seem to be a good reason for the redundant MOV instructions in the inline assembly. The paper first introduces inline assembly with the following statement:
asm volatile (
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1));
This has the obvious problem that it doesn't tell the compiler that EAX and EDX have been modified by the RDTSC instruction. The paper points out this mistake and corrects it using clobbers:
asm volatile ("RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
“%eax”, “%edx”)
No other justification is given for writing it this way other than correcting the mistake in the previous example. It appears that the paper's author is simply unaware that it could be written more simply as:
asm volatile ("RDTSC\n\t"
: "=d" (cycles_high), "=a" (cycles_low));
Similarly the author is apparently unaware that there's a simpler version of the improved asm statement that uses RDTSC in combination with CPUID, as you demonstrate in your post.
Note that the author of the paper repeatedly misuses the term "IA64" to refer the 64-bit x86 instruction set and architecture (variously referred as x86_64, AMD64 and Intel 64). The IA-64 architecture is actually something completely different, it's the one used by Intel's Itaninum CPUs. It has no EAX or RAX registers, and no RDTSC instruction.
While the it doesn't really matter that the authors inline assembly is more complex than it needs to be, this fact combined with the misuse of IA64, something that should've caught by Intel's editors, makes me doubt the credibility of this paper.