I'm trying to understand the concept of expression templates in C++, as such I've cobbled together pieces of example code etc to produce a simple vector and associated expression template infrastructure to support only binary operators (+,-,*).
Everything compiles, however I've noticed the performance difference between the standard hand written loop versus the expression template variant is quite large. ET is nearly twice as slow as the hand written. I expected a difference but not that much.
A complete code listing can be found here:
https://gist.github.com/BernieWt/769a4a3ceb90bb0cae9e
(apologies for the messy code.)
.
In short I'm essentially comparing the following two loops:
ET:
for (std::size_t i = 0 ; i < rounds; ++i)
{
v4 = ((v0 - v1) + (v2 * v3)) + v4;
total += v4[0];
}
HW:
for (std::size_t i = 0 ; i < rounds; ++i)
{
for (std::size_t x = 0; x < N; ++x)
{
v4[x] = (v0[x] - v1[x]) + (v2[x] * v3[x]) + v4[x];
}
total += v4[0];
}
When I disassemble the output, the following is produced, the difference is clearly the extra memcpy and several 64-bit loads that occurs during the return of the ET variant:
Standard Loop | Expression Template
----------------------------------------+--------------------------------
L26: | L12:
xor edx, edx | xor edx, edx
jmp .L27 | jmp .L13
L28: | L14:
movsd xmm3, QWORD PTR [rsp+2064+rdx*8] | movsd xmm3, QWORD PTR [rsp+2064+rdx*8]
L27: | L13:
movsd xmm2, QWORD PTR [rsp+1040+rdx*8] | movsd xmm1, QWORD PTR [rsp+1552+rdx*8]
movsd xmm1, QWORD PTR [rsp+16+rdx*8] | movsd xmm2, QWORD PTR [rsp+16+rdx*8]
mulsd xmm2, QWORD PTR [rsp+1552+rdx*8] | mulsd xmm1, QWORD PTR [rsp+1040+rdx*8]
subsd xmm1, QWORD PTR [rsp+528+rdx*8] | subsd xmm2, QWORD PTR [rsp+528+rdx*8]
addsd xmm1, xmm2 | addsd xmm1, xmm2
addsd xmm1, xmm3 | addsd xmm1, xmm3
movsd QWORD PTR [rsp+2064+rdx*8], xmm1 | movsd QWORD PTR [rsp+2576+rdx*8], xmm1
add rdx, 1 | add rdx, 1
cmp rdx, 64 | cmp rdx, 64
jne .L28 | jne .L14
| mov dx, 512
| movsd QWORD PTR [rsp+8], xmm0
| lea rsi, [rsp+2576]
| lea rdi, [rsp+2064]
| call memcpy
movsd xmm3, QWORD PTR [rsp+2064] | movsd xmm0, QWORD PTR [rsp+8]
sub rcx, 1 | sub rbx, 1
| movsd xmm3, QWORD PTR [rsp+2064]
addsd xmm0, xmm3 | addsd xmm0, xmm3
jne .L26 | jne .L12
My question is: At this point I'm stuck on how to go about removing the copy, I essentially want to update v4 in place without the copy. Any ideas on how to go about doing this?
Note1: I've tried GCC 4.7/9, Clang 3.3, VS2010/2013 - I get roughly the same performance profile on all the compilers mentioned.
Note2: I've also tried forward declaring bin_exp for vec and then adding the following assignment operator and removing the conversion operator from bin_exp,but to no avail:
template<typename LHS, typename RHS, typename Op>
inline vec<N>& operator=(const bin_exp<LHS,RHS,Op,N>& o)
{
for (std::size_t i = 0; i < N; ++i) { d[i] = o[i]; }
return *this;
}
UPDATE The solution presented in NOTE 2 is actually correct. and does cause the compiler to generate code near identical the the hand written loop.
.
On another note, if I rewrite the use-case for the ET variant to be as follows:
auto expr = ((v0 - v1) + (v2 * v3)) + v4;
//auto& expr = ((v0 - v1) + (v2 * v3)) + v4; same problem
//auto&& expr = ((v0 - v1) + (v2 * v3)) + v4; same problem
for (std::size_t i = 0 ; i < rounds; ++i)
{
v4 = expr
total += v4[0];
}
A crash occurs because the temporaries (rvalues) that are produced during the instantiation of the ET are destroyed prior to the assignment. I was wondering if there's any way using C++11 to cause a compiler error.