Consider this code:
#include <utility>
#include <tuple>
std::pair<int, int> f1()
{
return std::make_pair(0x111, 0x222);
}
std::tuple<int, int> f2()
{
return std::make_tuple(0x111, 0x222);
}
Clang 3 and 4 generate similar code for both on x86-64:
f1():
movabs rax,0x22200000111
ret
f2():
movabs rax,0x11100000222 ; opposite packing order, not important
ret
But Clang 5 generates different code for f2()
:
f2():
movabs rax,0x11100000222
mov QWORD PTR [rdi],rax
mov rax,rdi
ret
As do GCC 4 through GCC 7:
f2():
movabs rdx,0x11100000222
mov rax,rdi
mov QWORD PTR [rdi],rdx ; GCC 4-6 use 2 DWORD stores
ret
Why is the generated code worse when returning a std::tuple
that fits in a single register, vs std::pair
? It seems especially strange since Clang 3 and 4 seemed to be optimal, yet 5 is not.
Try it here: https://godbolt.org/g/T2Yqrj
The short answer is because the libstc++
standard library implementation used by gcc
and clang
on Linux implements std::tuple
with a non-trivial move constructor (in particular, the _Tuple_impl
base class has a non-trivial move constructor). On the other hand, the copy and move constructors for std::pair
are all defaulted.
The Gory Details
You ran your tests on Linux, which adheres to the SysV x86-64 ABI. This ABI has specific rules for passing or returning classes or structures to functions, which you can read more about here. The specific case we are interested in with whether the two int
fields in these structures will get the INTEGER
class or the MEMORY
class.
A recent version of the ABI specification has this to say:
The classification of aggregate (structures and arrays) and union
types works as follows:
- If the size of an object is larger than eight eightbytes, or it contains un- aligned fields, it has class MEMORY 12 .
- If a C++ object has either a non-trivial copy constructor or a non-trivial destructor 13 , it is passed by invisible reference (the
object is replaced in the parameter list by a pointer that has class
INTEGER) 14 .
- If the size of the aggregate exceeds a single eightbyte, each is classified separately. Each eightbyte gets initialized to class
NO_CLASS.
- Each field of an object is classified recursively so that always two fields are considered. The resulting class is calculated according
to the classes of the fields in the eightbyte
It is condition (2) that applies here. Note that it mentions only copy constructors, and not move constructors - but it is fairly apparently that just is probably just a defect in the specification given the introduction of move constructors which generally need to be included in any classification algorithm where copy constructors were included before. In particular, IA-64 cxx-abi, which gcc
is documented to follow does include move constructors:
If the parameter type is non-trivial for the purposes of calls, the
caller must allocate space for a temporary and pass that temporary by
reference. Specifically:
- Space is allocated by the caller in the usual manner for a temporary, typically on the stack.
and then the definition of non-trivial:
A type is considered non-trivial for the purposes of calls if:
- it has a non-trivial copy constructor, move constructor, or destructor, or
- all of its copy and move constructors are deleted.
So because tuple
is not considered to be trivially copyable from an ABI perspective, it gets MEMORY
treatment, which means that your function must populate the stack allocated object passed in by the called in rdi
. The std::par
function can just pass back the entire structure in rax
since it fits in one EIGHTBYTE
and has class INTEGER
.
Does it matter? Yeah, strictly speaking, a standalone function like the one you have compiled will be less efficient for tuple
since this ABI different is "baked in".
Often however, the compiler will be able to see the body of the function and inline it or perform inter-procedural analysis even if not inlined. In both cases, the ABI is no longer important and it is likely both approaches would be equally efficient, at least with a decent optimizer. For example let's call your f1()
and f2()
functions and do some math on the result:
int add_pair() {
auto p = f1();
return p.first + p.second;
}
int add_tuple() {
auto t = f2();
return std::get<0>(t) + std::get<1>(t);
}
In principle the add_tuple
method starts from a disadvantage, since it has to call f2()
which is less efficient and it also has to create a temporary tuple object on the stack so it can pass it to f2
as the hidden parameter. Well no matter, both functions are fully optimized to just return the right value directly:
add_pair():
mov eax, 819
ret
add_tuple():
mov eax, 819
ret
So overall you can say that the effect of this ABI issue with tuple
will be relatively muted: it adds a small fixed overhead to functions that must comply with the ABI, but this will only really matter in a relative sense for very small functions - but such functions are likely to be declared in a place where they can be inlined (or if not, you are leaving performance on the table).
libcstc++ vs libc+++
As explained above, this is an ABI issue, not an optimization issue, per se. Both clang and gcc are already optimizing the library code to maximum extent possible under the constraints of the ABI - if they generated code like f1()
for the std::tuple
case they would break ABI compliant callers.
You can see this clearly if you switch to using libc++
rather than the Linux default of libstdc++
- this implementation doesn't have the explicit move constructor (as Marc Glisse mentions in the comments, they are stuck with this implementation for backwards compatibility). Now clang
(and presumably gcc although I didn't try it), generates the same optimal code in both cases:
f1(): # @f1()
movabs rax, 2345052143889
ret
f2(): # @f2()
movabs rax, 2345052143889
ret
Earlier Versions of Clang
Why do versions of clang
compile it differently? It was simply a bug in clang or a bug in the spec depending on how you look at it. The spec didn't explicitly include move construction in the cases where a hidden pointer to a temporary needed to be passed. wasn't conforming to the IA-64 C++ ABI. For example compiled the way clang used to do it was not compatible with gcc
or newer versions of clang
. The spec was eventually updated and the clang behavior changed in version 5.0.
Update: Marc Glisse mentions in the comments that there was initially confusion about the interaction of non-trivial move constructors and the C++ ABI, and clang
changed their behavior at some point, which probably explains the switch:
The ABI specification for some argument passing cases involving move
constructors were unclear, and when they were clarified, clang changed
to follow the ABI. This is probably one of those cases.