I'm trying to optimize some code that's supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a significant performance bottleneck, as the code that stores data in memory as single precision is substantially slower than equivalent code that stores data in memory as double precision. Below is a toy C++ program that captures the essence of my issue:
#include <cstdio>
// noinline to force main() to actually read the value from memory.
__attributes__ ((noinline)) float* GetFloat() {
float* f = new float;
*f = 3.14;
return f;
}
int main() {
float* f = GetFloat();
double d = *f;
printf("%f\n", d); // Use the value so it isn't optimized out of existence.
}
Both GCC and Clang perform the loading of *f
and conversion to double precision as two separate instructions even though the cvtss2sd
instruction supports memory as the source argument. According to Agner Fog, cvtss2sd r, m
executes as fast as movss r, m
on most architectures, and avoids needing to execute cvtss2sd r, r
afterwords. Nonetheless, Clang generates the following code for main()
:
main PROC
push rbp ;
mov rbp, rsp ;
call _Z8GetFloatv ;
movss xmm0, dword ptr [rax] ;
cvtss2sd xmm0, xmm0 ;
mov edi, offset ?_001 ;
mov al, 1 ;
call printf ;
xor eax, eax ;
pop rbp ;
ret ;
main ENDP
GCC generates similarly inefficient code. Why don't either of these compilers simply generate something like cvtss2sd xmm0, dword ptr [rax]
?
EDIT: Great answer, Stephen Canon! I took the assembly language output of Clang for my real use case, pasted it into a source file as inline ASM, benchmarked it, then made the changes discussed here and benchmarked it again. I couldn't believe that cvtss2sd [memory]
is actually slower.
This is actually an optimization. CVTSS2SD from memory leaves the high 64 bits of the destination register unchanged. This means that a partial-register update occurs, which can incur a significant stall and greatly reduce ILP in many circumstances. MOVSS, on the other hand, zeros the unused bits of the register, which is dependency-breaking, and avoids the risk of the stall.
You may well have a bottleneck on conversion to double, but this isn't it.
I'll expand a little bit on exactly why the partial register update is a performance hazard.
I have no idea what computation is actually being performed, but let's suppose that it looks like this very simple example:
double accumulator, x;
float y[n];
for (size_t i=0; i<n; ++i) {
accumulator += x*(double)y[i];
}
The "obvious" codegen for the loop looks something like this:
loop_begin:
cvtss2sd xmm0, [y + 4*i]
mulsd xmm0, x
addsd accumulator, xmm0
// some loop arithmetic that I'll ignore; it isn't important.
Naively, the only loop-carried dependency is in the accumulator update, so asymptotically the loop should run at a speed of 1/(addsd
latency), which is 3 cycles per loop iteration on current "typical" x86 cores (see Agner Fog's tables or Intel's Optimization Manual for more details).
However, if we actually look at the operation of these instructions, we see that the high 64 bits of xmm0, even though they have no effect on the result we are interested in, form a second loop-carried dependency chain. Each cvtss2sd
instruction cannot begin until the result of the preceding loop iteration's mulsd
is available; this bounds the actual speed of the loop to 1/(cvtss2sd
latency + mulsd
latency), or 7 cycles per loop iteration on typical x86 cores (the good news is that you only pay the reg-reg conversion latency, because the conversion operation is cracked into two µops, and the load µop does not have a dependency on xmm0
, so it can be hoisted).
We can write out the operation of this loop as follows to make it a bit more clear (I'm ignoring the load-half of the cvtss2sd
, as those µops are nearly unconstrained and can happen more-or-less whenever):
cycle iteration 1 iteration 2 iteration 3
------------------------------------------------
0 cvtss2sd
1 .
2 mulsd
3 .
4 .
5 .
6 . --- xmm0[64:127]-->
7 addsd cvtss2sd(*)
8 . .
9 .-- accum -+ mulsd
10 | .
11 | .
12 | .
13 | . --- xmm0[64:127]-->
14 +-> addsd cvtss2sd
15 . .
(*) I'm actually simplifying things a bit; we need to consider not only latency but also port utilization in order to make this accurate. Considering only latency suffices to illustrate the stall in question, however, so I'm keeping it simple. Pretend we are running on a machine with infinite ILP resources.
Now suppose that we write the loop like this instead:
loop_begin:
movss xmm0, [y + 4*i]
cvtss2sd xmm0, xmm0
mulsd xmm0, x
addsd accumulator, xmm0
// some loop arithmetic that I'll ignore; it isn't important.
Because movss
from memory zeros bits [32:127] of xmm0, there is no longer a loop-carried dependency on xmm0, so we are bound by accumulation latency, as expected; execution at steady state looks something like this:
cycle iteration i iteration i+1 iteration i+2
------------------------------------------------
0 cvtss2sd .
1 . .
2 mulsd . movss
3 . cvtss2sd .
4 . . .
5 . mulsd .
6 . . cvtss2sd
7 addsd . .
8 . . mulsd
9 . . .
10 . -- accum --> addsd .
11 . .
12 . .
13 . -- accum --> addsd
Note that in my toy example, there's still a lot more to be done to optimize the code in question after eliminating the partial-register-update stall. It can be vectorized, and multiple accumulators can be used (at the cost of changing the specific rounding that occurs) to minimize the effect of the loop-carried accumulate-to-accumulate latency.