I try to understand the implication of System V AMD64 - ABI's calling convention and looking at the following example:
struct Vec3{
double x, y, z;
};
struct Vec3 do_something(void);
void use(struct Vec3 * out){
*out = do_something();
}
A Vec3
-variable is of type MEMORY and thus the caller (use
) must allocate space for the returned variable and pass it as hidden pointer to the callee (i.e. do_something
). Which is what we see in the resulting assembler (on godbolt, compiled with -O2
):
use:
pushq %rbx
movq %rdi, %rbx ;remember out
subq $32, %rsp ;memory for returned object
movq %rsp, %rdi ;hidden pointer to %rdi
call do_something
movdqu (%rsp), %xmm0 ;copy memory to out
movq 16(%rsp), %rax
movups %xmm0, (%rbx)
movq %rax, 16(%rbx)
addq $32, %rsp ;unwind/restore
popq %rbx
ret
I understand, that an alias of pointer out
(e.g. as global variable) could be used in do_something
and thus out
cannot be passed as hidden pointer to do_something
: if it would, out
would be changed inside of do_something
and not when do_something
returns, thus some calculations might become faulty. For example this version of do_something
would return faulty results:
struct Vec3 global; //initialized somewhere
struct Vec3 do_something(void){
struct Vec3 res;
res.x = 2*global.x;
res.y = global.y+global.x;
res.z = 0;
return res;
}
if out
where an alias for the global variable global
and were used as hidden pointer passed in %rdi
, res
were also an alias of global
, because the compiler would use the memory pointed to by hidden pointer directly (a kind of RVO in C), without actually creating a temporary object and copying it when returned, then res.y
would be 2*x+y
(if x,y
are old values of global
) and not x+y
as for any other hidden pointer.
It was suggested to me, that using restrict
should solve the problem, i.e.
void use(struct Vec3 *restrict out){
*out = do_something();
}
because now, the compiler knows, that there are no aliases of out
which could be used in do_something
, so the assembler could be as simple as this:
use:
jmp do_something ; %rdi is now the hidden pointer
However, this is not the case neither for gcc nor for clang - the assembler stays unchanged (see on godbolt).
What prevents the usage of out
as hidden pointer?
NB: The desired (or very similar) behavior would be achieved for a slightly different function-signature:
struct Vec3 use_v2(){
return do_something();
}
which results in (see on godbolt):
use_v2:
pushq %r12
movq %rdi, %r12
call do_something
movq %r12, %rax
popq %r12
ret
A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else. i.e. that its output pointer (passed as a hidden first arg) doesn't alias anything.
You could think of this as the hidden first arg output pointer having an implicit
restrict
on it. (Because in the C abstract machine, the return value is a separate object, and the x86-64 System V specifies that the caller provides space. x86-64 SysV doesn't give the caller license to introduce aliasing.)Using an otherwise-private local as the destination (instead of separate dedicated space and then copying to a real local) is fine, but pointers that may point to something reachable another way must not be used. This requires escape analysis to make sure that a pointer to such a local hasn't been passed outside of the function.
I think the x86-64 SysV calling convention models the C abstract machine here by having the caller provide a real return-value object, not forcing the callee to invent that temporary if needed to make sure all the writes to the retval happened after any other writes. That's not what "the caller provides space for the return value" means, IMO.
That's definitely how GCC and other compilers interpret it in practice, which is a big part of what matters in a calling convention that's been around this long (since a year or two before the first AMD64 silicon, so very early 2000s).
Here's a case where your optimization would break if it were done:
With the optimization you're suggesting,
do_something
's output object would beglob3
. But it also readsglob3
.A valid implementation for
do_something
would be to copy elements fromglob3
to(%rdi)
in source order, which would doglob3.x = glob3.y
before readingglob3.x
as the 3rd element of the return value.That is in fact exactly what
gcc -O1
does (Godbolt compiler explorer)Notice the
glob3.y, <retval>.x
store before the load ofglob3.x
.So without
restrict
anywhere in the source, GCC already emits asm fordo_something
that assumes no aliasing between the retval andglob3
.I don't think using
struct Vec3 *restrict out
wouldn't help at all: that only tells the compiler that insideuse()
you won't access the*out
object through any other name. Sinceuse()
doesn't referenceglob3
, it's not UB to pass&glob3
as an arg to arestrict
version ofuse
.I may be wrong here; @M.M argues in comments that
*restrict out
might make this optimization safe because the execution ofdo_something()
happens duringout()
. (Compilers still don't actually do it, but maybe they would be allowed to forrestrict
pointers.)Update: Richard Biener said in the GCC missed-optimization bug-report that M.M is correct, and if the compiler can prove that the function returns normally (not exception or longjmp), the optimization is legal in theory (but still not something GCC is likely to look for):
There's a
noexecpt
declaration, but there isn't (AFAIK) anolongjmp
declaration you can put on a prototype.So that means it's only possible (even in theory) as an inter-procedural optimization when we can see the other function's body. Unless
noexcept
also means nolongjmp
.The answers of @JohnBollinger and @PeterCordes cleared a lot of things for me, but I decided to bug gcc-developers. Here is how I understand their answer.
As @PeterCordes has pointed out, the callee assumes, that the hidden pointer is restrict. However it makes also another (less obvious) assumption: the memory to which the hidden pointer points is uninitialized.
Why this is important, is probably simpler to see with the help of a C++-example:
do_something
writes directly to the memory pointed to by%rdi
(as shown in the multiple listings in this Q&A), and it is allowed do so, only because this memory is uninitialized: iffunc_which_throws()
throws and the exception is caught somewhere, then nobody will know, that we have changed only the x-component ot the result, because nobody knows which original value it had prior to be passed todo_something
(nobody could have read the original value, because it would be UB).The above would break for passing
out
-pointer as hidden pointer, because it could be observed, that only a part and not the whole memory was changed in case of an exception being thrown and caught.Now, C has something similar to C++'s exceptions:
setjmp
andlongjmp
. Never heard of them before, but it looks like in comparison to C++-examplesetjmp
is best described astry ... catch ...
andlongjmp
asthrow
.This means, that also for C we must ensure, that the space provided by the caller is uninitialized.
Even without
setjmp/longjmp
there are some other issues, among others: interoperability with C++-code, which has exceptions, and-fexceptions
option of gcc-compiler.Corollary: The desired optimization would be possible if we had a qualifer for unitialized memory (which we don't have), e.g.
uninit
, thenwould do the trick.
Substantially rewritten:
Except with respect to aliasing considerations inside
do_something()
, the difference in timing with respect to when*out
is modified is irrelevant in the sense thatuse()
's caller cannot tell the difference. Such issues arise only with respect to accesses from other threads, and if that's a possibility then they arise anyway unless appropriate synchronization is applied.No, the issue is primarily that the ABI defines how passing arguments to functions and receiving their return values works. It specifies that
(emphasis added).
I grant that there's room for interpretation, but I take that as a stronger statement than just that the caller specifies where to store the return value. That it "provides" space means to me that the space in question belongs to the caller (which your
*out
does not). By analogy with argument passing, there's good reason to interpret that more specifically as saying that the caller provides space on the stack (and therefore in its own stack frame) for the return value, which in fact is exactly what you observe, though that detail doesn't really matter.With that interpretation, the called function is free to assume that the return-value space is disjoint from any space it can access via any pointer other than one of its arguments. That this is supplemented by a more general requirement that the return space not be aliased (i.e. not through the function arguments either) does not contradict that interpretation. It may therefore perform operations that would be incorrect if in fact the space were aliased to something else accessible to the function.
The compiler is not at liberty to depart from the ABI specifications if the function call is to work correctly with a separately-compiled
do_something()
function. In particular, with separate compilation, the compiler cannot make decisions based on characteristics of the function's caller, such as aliasing information known there. Ifdo_something()
anduse()
were in the same translation unit, then the compiler might choose to inlineso_something()
intouse()
, or it might choose to perform the optimization you're looking for without inlining, but it cannot safely do so in the general case.restrict
gives the compiler greater leeway to optimize, but that in itself does not give you any reason to expect specific optimizations that might then be possible. In fact, the language standard explicitly specifies that(C2011, 6.7.3.1/6)
restrict
-qualifyingout
expresses that the compiler doesn't need to worry about it being aliased to any other pointer accessed within the scope of a call touse()
, including during the execution of functions other functions it calls. In principle, then, I could see a compiler taking advantage of that to shortcut the ABI by offering somebody else's space for the return value instead of providing space itself, but just because it could do does not mean that it will do.ABI compliance. The caller is expected to provide space that belongs to it, not to someone else, for storage of the return value. As a practical matter, however, I don't see anything in the
restrict
-qualified case that would invalidate shortcutting the ABI, so I take it that that's just not an optimization that has been implemented by the compiler in question.That case looks like a tail-call optimization to me. I don't see anything inherently inconsistent in the compiler performing that optimization, but not the one you're asking about, even though it is, to be sure, a different example of shortcutting the ABI.