struct Big {
int a[8];
};
void foo(Big a);
Big getStuff();
void test1() {
foo(getStuff());
}
compiles (using clang 6.0.0 for x86_64 on Linux so System V ABI, flags: -O3 -march=broadwell
) to
test1(): # @test1()
sub rsp, 72
lea rdi, [rsp + 40]
call getStuff()
vmovups ymm0, ymmword ptr [rsp + 40]
vmovups ymmword ptr [rsp], ymm0
vzeroupper
call foo(Big)
add rsp, 72
ret
If I am reading this correctly, this is what is happening:
getStuff
is passed a pointer to foo
's stack (rsp + 40
) to use for its return value, so after getStuff
returns rsp + 40
through to rsp + 71
contains the result of getStuff
.
- This result is then immediately copied to a lower stack address
rsp
through to rsp + 31
.
foo
is then called, which will read its argument from rsp
.
Why is the following code not totally equivalent (and why doesn't the compiler generate it instead)?
test1(): # @test1()
sub rsp, 32
mov rdi, rsp
call getStuff()
call foo(Big)
add rsp, 32
ret
The idea is: have getStuff
write directly to the place in the stack that foo
will read from.
Also:
Here is the result for the same code (with 12 ints instead of 8) compiled by vc++ on windows for x64, which seems even worse because the windows x64 ABI passes and returns by reference, so the copy is completely unused!
_TEXT SEGMENT
$T3 = 32
$T1 = 32
?bar@@YAHXZ PROC ; bar, COMDAT
$LN4:
sub rsp, 88 ; 00000058H
lea rcx, QWORD PTR $T1[rsp]
call ?getStuff@@YA?AUBig@@XZ ; getStuff
lea rcx, QWORD PTR $T3[rsp]
movups xmm0, XMMWORD PTR [rax]
movaps XMMWORD PTR $T3[rsp], xmm0
movups xmm1, XMMWORD PTR [rax+16]
movaps XMMWORD PTR $T3[rsp+16], xmm1
movups xmm0, XMMWORD PTR [rax+32]
movaps XMMWORD PTR $T3[rsp+32], xmm0
call ?foo@@YAHUBig@@@Z ; foo
add rsp, 88 ; 00000058H
ret 0
You're right; this looks like a missed-optimization by the compiler. You can report this bug (https://bugs.llvm.org/) if there isn't already a duplicate.
Contrary to popular belief, compilers often don't make optimal code. It's often good enough, and modern CPUs are quite good at plowing through excess instructions when they don't lengthen dependency chains too much, especially the critical path dependency chain if there is one.
x86-64 SysV passes large structs by value on the stack if they don't fit packed into two 64-bit integer registers, and them returns via hidden pointer. The compiler can and should (but doesn't) plan ahead and reuse the return value temporary as the stack-args for the call to foo(Big)
.
gcc7.3, ICC18, and MSVC CL19 also miss this optimization. :/ I put your code up on the Godbolt compiler explorer with gcc/clang/ICC/MSVC. gcc uses 4x push qword [rsp+24]
to copy, while ICC uses extra instructions to align the stack by 32.
Using 1x 32-byte load/store instead of 2x 16-byte probably doesn't justify the cost of the vzeroupper
for MSVC / ICC / clang, for a function this small. vzeroupper
is cheap on mainstream Intel CPUs (only 4 uops), and I did use -march=haswell
to tune for that, not for AMD or KNL where it's more expensive.
Related: x86-64 Windows passes large structs by hidden pointer, as well as returning them that way. The callee owns the pointed-to memory. (What happens at assembly level when you have functions with large inputs)
This optimization would still be available by simply reserving space for the temporary + shadow-space before the first call to getStuff()
, and allowing the callee to destroy the temporary because we don't need it later.
That's not actually what MSVC does here or in related cases, though, unfortunately.
See also @BeeOnRope's answer, and my comments onit, on Why isn't pass struct by reference a common optimization?. Making sure the copy-constructor can always run at a sane place for non-trivially-copyable objects is problematic if you're trying to design a calling convention that avoids copying by passing by hidden const-reference (caller owns the memory, callee can copy if needed).
But this is an example of a case where non-const reference (callee owns the memory) is best, because the caller wants to hand off the object to the callee.
There's a potential gotcha, though: if there are any pointers to this object, letting the callee use it directly could introduce bugs. Consider some other function that does global_pointer->a[4]=0;
. If our callee calls that function, it will unexpectedly modify our callee's by-value arg.
So letting the callee destroy our copy of the object in the Windows x64 calling convention only works if escape analysis can prove that nothing else has a pointer to this object.