Let's look at one particular benefit of expression templates: ETs can be used to avoid vector-sized temporaries in memory which occur in overloaded operators like:
template<typename T>
std::vector<T> operator+(const std::vector<T>& a, const std::vector<T>& b)
{
std::vector<T> tmp; // vector-sized temporary
for_each(...);
return tmp;
}
In C++11 the return statement of this function applies move semantics. No copy of the vector. That's a win.
However, if I look at a simple expression like
d = a + b + c;
I see that the above function gets called twice (for both operator+
) while the final assignment can be done with move semantics.
In total 2 loops are executed. Means that I put out a temporary and read it back in right after. For big vectors this falls out of cache. That's worse than expression templates. They can do the whole thing in just 1 loop. ETs can execute the above code equivalent to:
for(int i=0 ; i < vec_length ; ++i)
d[i] = a[i] + b[i] + c[i];
I was wondering whether lambdas together with move semantics or any other new feature can do as good as ETs. Any thoughts?
Edit:
Basically, using the ET technique the compiler builds a parse tree that resembles the algebraic expression with it's type system. This tree consists of inner nodes and leaf nodes. The inner nodes represent operations (addition, multiplication, etc.) and the leaf nodes represent references to the data objects.
I tried to think of the whole computation process in the fashion of a stack machine: Take an operation from an operation stack and pull the next arguments from the argument stack and evaluate the operation. Put the result back on the stack waiting for the operation.
To represent these two different objects (operation stack and data
leaf stack) I bundled together a std::tuple
for the operations and a
std::tuple
for the data leaves into a std::pair<>
. Initially I
used a std:vector
but that resulted in runtime overhead.
The whole process goes in two phases: Stack machine initialisation where the operation and argument stack are initialised. And the evaluation phase which is triggered by assigning the paired containers to the vector.
I created a class Vec
which holds a private array<int,5>
(the
payload) and which features an overloaded assignment operator that
takes the "expression".
The global operator*
is overloaded for all combinations of taking
Vec
and "expression" to enable the correct handling also in the case
where we have more than just a*b
. (Notice, I switched for this
educational example to the multiplication - basically to quickly spot
the imull
in the assembler.)
What is done first before the evaluation starts is "extracting" the
values out of the involved Vec
objects and initializing the argument
stack. That was necessary to not have different kinds of objects lying
around: Indexable vectors and non-indexable results. This is what the
Extractor
is for. Nice thing again: Variadic templates are used which
in this case results in no run-time overhead (all this is done at
compile time).
The whole thing works. The expression is nicely evaluated (I also added the addition, but that is left out here to fit the code). Below you can see the assembler output. Just raw compuation, exactly as you want it to be: En-par with ET technique.
Upshot. The new language features of C++11 offer the variadic templates which (along with template meta-programming) open up the area of compile time computation. I showed here how the benefits of variadic templates can be used to produce code as good as with the traditional ET technique.
#include<algorithm>
#include<iostream>
#include<vector>
#include<tuple>
#include<utility>
#include<array>
template<typename Target,typename Tuple, int N, bool end>
struct Extractor {
template < typename ... Args >
static Target index(int i,const Tuple& t, Args && ... args)
{
return Extractor<Target, Tuple, N+1,
std::tuple_size<Tuple>::value == N+1>::
index(i, t , std::forward<Args>(args)..., std::get<N>(t).vec[i] );
}
};
template < typename Target, typename Tuple, int N >
struct Extractor<Target,Tuple,N,true>
{
template < typename ... Args >
static Target index(int i,Tuple const& t,
Args && ... args) {
return Target(std::forward<Args>(args)...); }
};
template < typename ... Vs >
std::tuple<typename std::remove_reference<Vs>::type::type_t...>
extract(int i , const std::tuple<Vs...>& tpl)
{
return Extractor<std::tuple<typename std::remove_reference<Vs>::type::type_t...>,
std::tuple<Vs...>, 0,
std::tuple_size<std::tuple<Vs...> >::value == 0>::index(i,tpl);
}
struct Vec {
std::array<int,5> vec;
typedef int type_t;
template<typename... OPs,typename... VALs>
Vec& operator=(const std::pair< std::tuple<VALs...> , std::tuple<OPs...> >& e) {
for( int i = 0 ; i < vec.size() ; ++i ) {
vec[i] = eval( extract(i,e.first) , e.second );
}
}
};
template<int OpPos,int ValPos, bool end>
struct StackMachine {
template<typename... OPs,typename... VALs>
static void eval_pos( std::tuple<VALs...>& vals , const std::tuple<OPs...> & ops )
{
std::get<ValPos+1>( vals ) =
std::get<OpPos>(ops).apply( std::get<ValPos>( vals ) ,
std::get<ValPos+1>( vals ) );
StackMachine<OpPos+1,ValPos+1,sizeof...(OPs) == OpPos+1>::eval_pos(vals,ops);
}
};
template<int OpPos,int ValPos>
struct StackMachine<OpPos,ValPos,true> {
template<typename... OPs,typename... VALs>
static void eval_pos( std::tuple<VALs...>& vals ,
const std::tuple<OPs...> & ops )
{}
};
template<typename... OPs,typename... VALs>
int eval( const std::tuple<VALs...>& vals , const std::tuple<OPs...> & ops )
{
StackMachine<0,0,false>::eval_pos(const_cast<std::tuple<VALs...>&>(vals),ops);
return std::get<sizeof...(OPs)>(vals);
}
struct OpMul {
static int apply(const int& lhs,const int& rhs) {
return lhs*rhs;
}
};
std::pair< std::tuple< const Vec&, const Vec& > , std::tuple<OpMul> >
operator*(const Vec& lhs,const Vec& rhs)
{
return std::make_pair( std::tuple< const Vec&, const Vec& >( lhs , rhs ) ,
std::tuple<OpMul>( OpMul() ) );
}
template<typename... OPs,typename... VALs>
std::pair< std::tuple< const Vec&, VALs... > , std::tuple<OPs...,OpMul> >
operator*(const Vec& lhs,const std::pair< std::tuple< VALs... > , std::tuple<OPs...> >& rhs)
{
return std::make_pair( std::tuple_cat( rhs.first , std::tuple< const Vec& >(lhs) ) ,
std::tuple_cat( rhs.second , std::tuple<OpMul>( OpMul() ) ) );
}
template<typename... OPs,typename... VALs>
std::pair< std::tuple< const Vec&, VALs... > , std::tuple<OPs...,OpMul> >
operator*(const std::pair< std::tuple< VALs... > , std::tuple<OPs...> >& lhs,
const Vec& rhs)
{
return std::make_pair( std::tuple_cat( lhs.first , std::tuple< const Vec& >(rhs) ) ,
std::tuple_cat( lhs.second , std::tuple<OpMul>( OpMul() ) ) );
}
int main()
{
Vec d,c,b,a;
for( int i = 0 ; i < d.vec.size() ; ++i ) {
a.vec[i] = 10+i;
b.vec[i] = 20+i;
c.vec[i] = 30+i;
d.vec[i] = 0;
}
d = a * b * c * a;
for( int i = 0 ; i < d.vec.size() ; ++i )
std::cout << d.vec[i] << std::endl;
}
Assembler generated with g++-4.6 -O3
(I had to put some runtime dependence into the vector initialization so that the compiler doesn't calculate the whole thing at compile time and you actually see the imull
instaructions.)
imull %esi, %edx
imull 32(%rsp), %edx
imull %edx, %esi
movl 68(%rsp), %edx
imull %ecx, %edx
movl %esi, (%rsp)
imull 36(%rsp), %edx
imull %ecx, %edx
movl 104(%rsp), %ecx
movl %edx, 4(%rsp)
movl 72(%rsp), %edx
imull %ecx, %edx
imull 40(%rsp), %edx
imull %ecx, %edx
movl 108(%rsp), %ecx
movl %edx, 8(%rsp)
movl 76(%rsp), %edx
imull %ecx, %edx
imull 44(%rsp), %edx
imull %ecx, %edx
movl 112(%rsp), %ecx
movl %edx, 12(%rsp)
movl 80(%rsp), %edx
imull %ecx, %edx
imull %eax, %edx
imull %ecx, %edx
movl %edx, 16(%rsp)
Quick Answer
Move semantics are not a total panacea on their own --techniques such as expression templates (ETs) are still needed in C++11 to eliminate overheads such as moving data around! So, to answer your question quickly before diving into the rest of my answer, move semantics, etc. doesn't completely replace ETs as my answer illustrates below.
Detailed Answer
ETs typically return proxy objects to defer evaluation until later, so there is no immediate apparent benefit of C++11 language features until the code that triggers the computation. That said, one would not want to write ET code, however, that triggers run-time code generation during the building of the expression tree with the proxies. Nicely, C++11's move semantics and perfect forwarding can help avoid such overheads should that otherwise occur. (Such would not have been possible in C++03.)
Essentially, when writing ETs one wants to exploit the language features in a way to generate optimal code once the member function(s) of the involved proxy objects are invoked. In C++11 this will include using perfect forwarding, move semantics over copying, etc. if such is actually still needed over and above what the compiler can already do. The name of the game is to minimize the run-time code generated and/or maximize the run-time speed and/or minimize the run-time overhead.
I wanted to actually try some ETs with C++11 features to see if I could elide ALL intermediate temporary instance types with a
a = b + c + d;
expression. (As this was just a fun break from my normal activities so I did not compare it to or write ET code purely using C++03. Also I did not worry about all aspects of code polishing that appears below.)To start with, I did not use lambdas --as I preferred to use explicit types and functions-- so I won't argue for/against lambdas with respect to your question. My guess is that they would be similar to using functors and performing no better than the non-ET code below (i.e., moves would be required) --at least until compilers can automatically optimize lambdas using their own internal ETs for such. The code I wrote, however, exploits move semantics and perfect forwarding. Here's what I did starting with the results and then finally presenting the code.
I created a
math_vector<N>
class whereN==3
and it defines an internal private instance ofstd::array<long double, N>
. The members are a default constructor, copy and move constructors and assignments, an initializer list constructor, a destructor, a swap() member, operator [] to access elements of the vector and operator +=. Used without any expression templates, this code:outputs (when compiled with
clang++
3.1 org++
4.8 with -std=c++11 -O3
):i.e., the four explicit constructed instances using initializer lists (i.e., the
initlist
items), theresult
variable (i.e.,0x7fff8d6effffd0
), and, also makes an additional three objects copying and moving.To only focus on temporaries and moving, I created a second case that only creates
result
as a named variable --all others are rvalues:which outputs this (again when ETs are NOT used):
which is better: only extra move objects are created.
But I wanted better: I wanted zero extra temporaries and to have the code as if I hard-coded it with the one normal coding caveat: all explicitly instantiated types would still be created (i.e., the four
initlist
constructors andresult
). To accomplish this I then added expression template code as follows:math_vector_expr<LeftExpr,BinaryOp,RightExpr>
class was created to hold an expression not computed yet,plus_op
class was created to hold the addition operation,math_vector
to accept amath_vector_expr
object, and,The results using ETs are wonderful: no extra temporaries in either case! The previous two cases above now output:
i.e., exactly 5 constructor calls and 5 destructor calls in each case. In fact, if you ask the compiler to generate the assembler code between the 4
initlist
constructor calls and the outputting ofresult
one gets this beautiful string of assembler code:with
g++
andclang++
outputs similar (even smaller) code. No function calls, etc. --just a bunch of adds which is EXACTLY what one wants!The C++11 code to achieve this follows. Simply
#define DONT_USE_EXPR_TEMPL
to not use ETs or don't define it at all to use ETs.Here is the corrected version of Paul Preney's code. I have informed the author by email and in comments; I have written an edit, but it has been rejected by unqualified reviewers.
The error in the original code is that BinaryOp template parameter of math_vector_expr::operator+ is fixed.