This question already has an answer here:
- When a C++ lambda expression has a lot of captures by reference, the size of the unnamed function object becomes large 3 answers
I recently needed a lambda that captured multiple local variables by reference, so I made a test snippet to investigate its efficiency, and compiled it with -O3
using clang 3.6:
void do_something_with(void*);
void test()
{
int a = 0, b = 0, c = 0;
auto func = [&] () {
a++;
b++;
c++;
};
do_something_with((void*)&func);
}
movl $0x0,0x24(%rsp)
movl $0x0,0x20(%rsp)
movl $0x0,0x1c(%rsp)
lea 0x24(%rsp),%rax
mov %rax,(%rsp)
lea 0x20(%rsp),%rax
mov %rax,0x8(%rsp)
lea 0x1c(%rsp),%rax
mov %rax,0x10(%rsp)
lea (%rsp),%rdi
callq ...
Clearly the lambda only needs the address of one of the variables, from which all the others could be obtained by relative addressing.
Instead, the compiler created a struct on the stack containing pointers to each local variable, and then passed the address of the struct to the lambda. It's much in the same way as if I had written:
int a = 0, b = 0, c = 0;
struct X
{
int *pa, *pb, *pc;
};
X x = {&a, &b, &c};
auto func = [p = &x] () {
(*p->pa)++;
(*p->pb)++;
(*p->pc)++;
};
This is inefficient for various reasons, but most worryingly because it could lead to heap-allocation if too many variables are captured.
My questions:
The fact that both clang and gcc do this at
-O3
makes me suspect that something in the standard actually forces closures to be implemented inefficiently. Is this the case?If so, then for what reasoning? It cannot be for binary compatibility of lambdas between compilers, because any code that knows about the type of the lambda is guaranteed to lie in the same translation unit.
If not, then why is this optimisation missing from two major compilers?
EDIT:
Here is an example of the more efficient code that I would like to have seen from the compiler. This code uses less stack space, the lambda now only performs one pointer indirection instead of two, and the lambda's size does not grow in the number of captured variables:
struct X
{
int a = 0, b = 0, c = 0;
} x;
auto func = [&x] () {
x.a++;
x.b++;
x.c++;
};
movl $0x0,0x8(%rsp)
movl $0x0,0xc(%rsp)
movl $0x0,0x10(%rsp)
lea 0x8(%rsp),%rax
mov %rax,(%rsp)
lea (%rsp),%rdi
callq ...