Thanks to C++11 we received the std::function
family of functor wrappers. Unfortunately, I keep hearing only bad things about these new additions. The most popular is that they are horribly slow. I tested it and they truly suck in comparison with templates.
#include <iostream>
#include <functional>
#include <string>
#include <chrono>
template <typename F>
float calc1(F f) { return -1.0f * f(3.3f) + 666.0f; }
float calc2(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; }
int main() {
using namespace std::chrono;
const auto tp1 = system_clock::now();
for (int i = 0; i < 1e8; ++i) {
calc1([](float arg){ return arg * 0.5f; });
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
return 0;
}
111 ms vs 1241 ms. I assume this is because templates can be nicely inlined, while function
s cover the internals via virtual calls.
Obviously templates have their issues as I see them:
- they have to be provided as headers which is not something you might not wish to do when releasing your library as a closed code,
- they may make the compilation time much longer unless
extern template
-like policy is introduced, - there is no (at least known to me) clean way of representing requirements (concepts, anyone?) of a template, bar a comment describing what kind of functor is expected.
Can I thus assume that function
s can be used as de facto standard of passing functors, and in places where high performance is expected templates should be used?
Edit:
My compiler is the Visual Studio 2012 without CTP.
You already have some good answers here, so I'm not going to contradict them, in short comparing std::function to templates is like comparing virtual functions to functions. You never should "prefer" virtual functions to functions, but rather you use virtual functions when it fits the problem, moving decisions from compile time to run time. The idea is that rather than having to solve the problem using a bespoke solution (like a jump-table) you use something that gives the compiler a better chance of optimizing for you. It also helps other programmers, if you use a standard solution.
With Clang there's no performance difference between the two
Using clang (3.2, trunk 166872) (-O2 on Linux), the binaries from the two cases are actually identical.
-I'll come back to clang at the end of the post. But first, gcc 4.7.2:
There's already a lot of insight going on, but I want to point out that the result of the calculations of calc1 and calc2 are not the same, due to in-lining etc. Compare for example the sum of all results:
with calc2 that becomes
while with calc1 it becomes
that's a factor of ~40 in speed difference, and a factor of ~4 in the values. The first is a much bigger difference than what OP posted (using visual studio). Actually printing out the value a the end is also a good idea to prevent the compiler to removing code with no visible result (as-if rule). Cassio Neri already said this in his answer. Note how different the results are -- One should be careful when comparing speed factors of codes that perform different calculations.
Also, to be fair, comparing various ways of repeatedly calculating f(3.3) is perhaps not that interesting. If the input is constant it should not be in a loop. (It's easy for the optimizer to notice)
If I add a user supplied value argument to calc1 and 2 the speed factor between calc1 and calc2 comes down to a factor of 5, from 40! With visual studio the difference is close to a factor of 2, and with clang there is no difference (see below).
Also, as multiplications are fast, talking about factors of slow-down is often not that interesting. A more interesting question is, how small are your functions, and are these calls the bottleneck in a real program?
Clang:
Clang (I used 3.2) actually produced identical binaries when I flip between calc1 and calc2 for the example code (posted below). With the original example posted in the question both are also identical but take no time at all (the loops are just completely removed as described above). With my modified example, with -O2:
Number of seconds to execute (best of 3):
The calculated results of all binaries are the same, and all tests were executed on the same machine. It would be interesting if someone with deeper clang or VS knowledge could comment on what optimizations may have been done.
My modified test code:
Update:
Added vs2015. I also noticed that there are double->float conversions in calc1,calc2. Removing them does not change the conclusion for visual studio (both are a lot faster but the ratio is about the same).
Different isn't the same.
It's slower because it does things that a template can't do. In particular, it lets you call any function that can be called with the given argument types and whose return type is convertible to the given return type from the same code.
Note that the same function object,
fun
, is being passed to both calls toeval
. It holds two different functions.If you don't need to do that, then you should not use
std::function
.This answer is intended to contribute, to the set of existing answers, what I believe to be a more meaningful benchmark for the runtime cost of std::function calls.
The std::function mechanism should be recognized for what it provides: Any callable entity can be converted to a std::function of appropriate signature. Suppose you have a library that fits a surface to a function defined by z = f(x,y), you can write it to accept a
std::function<double(double,double)>
, and the user of the library can easily convert any callable entity to that; be it an ordinary function, a method of a class instance, or a lambda, or anything that is supported by std::bind.Unlike template approaches, this works without having to recompile the library function for different cases; accordingly, little extra compiled code is needed for each additional case. It has always been possible to make this happen, but it used to require some awkward mechanisms, and the user of the library would likely need to construct an adapter around their function to make it work. std::function automatically constructs whatever adapter is needed to get a common runtime call interface for all the cases, which is a new and very powerful feature.
To my view, this is the most important use case for std::function as far as performance is concerned: I'm interested in the cost of calling a std::function many times after it has been constructed once, and it needs to be a situation where the compiler is unable to optimize the call by knowing the function actually being called (i.e. you need to hide the implementation in another source file to get a proper benchmark).
I made the test below, similar to the OP's; but the main changes are:
The results I get are:
case (a) (inline) 1.3 nsec
all other cases: 3.3 nsec.
Case (d) tends to be slightly slower, but the difference (about 0.05 nsec) is absorbed in the noise.
Conclusion is that the std::function is comparable overhead (at call time) to using a function pointer, even when there's simple 'bind' adaptation to the actual function. The inline is 2 ns faster than the others but that's an expected tradeoff since the inline is the only case which is 'hard-wired' at run time.
When I run johan-lundberg's code on the same machine, I'm seeing about 39 nsec per loop, but there's a lot more in the loop there, including the actual constructor and destructor of the std::function, which is probably fairly high since it involves a new and delete.
-O2 gcc 4.8.1, to x86_64 target (core i5).
Note, the code is broken up into two files, to prevent the compiler from expanding the functions where they are called (except in the one case where it's intended to).
----- first source file --------------
----- second source file -------------
For those interested, here's the adaptor the compiler built to make 'mul_by' look like a float(float) - this is 'called' when the function created as bind(mul_by,_1,0.5) is called:
(so it might have been a bit faster if I'd written 0.5f in the bind...) Note that the 'x' parameter arrives in %xmm0 and just stays there.
Here's the code in the area where the function is constructed, prior to calling test_stdfunc - run through c++filt :
I found your results very interesting so I did a bit of digging to understand what is going on. First off as many others have said with out having the results of the computation effect the state of the program the compiler will just optimize this away. Secondly having a constant 3.3 given as an armament to the callback I suspect that there will be other optimizations going on. With that in mind I changed your benchmark code a little bit.
Given this change to the code I compiled with gcc 4.8 -O3 and got a time of 330ms for calc1 and 2702 for calc2. So using the template was 8 times faster, this number looked suspects to me, speed of a power of 8 often indicates that the compiler has vectorized something. when I looked at the generated code for the templates version it was clearly vectoreized
Where as the std::function version was not. This makes sense to me, since with the template the compiler knows for sure that the function will never change throughout the loop but with the std::function being passed in it could change, therefor can not be vectorized.
This led me to try something else to see if I could get the compiler to perform the same optimization on the std::function version. Instead of passing in a function I make a std::function as a global var, and have this called.
With this version we see that the compiler has now vectorized the code in the same way and I get the same benchmark results.
So my conclusion is the raw speed of a std::function vs a template functor is pretty much the same. However it makes the job of the optimizer much more difficult.
Andy Prowl has nicely covered design issues. This is, of course, very important, but I believe the original question concerns more performance issues related to
std::function
.First of all, a quick remark on the measurement technique: The 11ms obtained for
calc1
has no meaning at all. Indeed, looking at the generated assembly (or debugging the assembly code), one can see that VS2012's optimizer is clever enough to realize that the result of callingcalc1
is independent of the iteration and moves the call out of the loop:Furthermore, it realises that calling
calc1
has no visible effect and drops the call altogether. Therefore, the 111ms is the time that the empty loop takes to run. (I'm surprised that the optimizer has kept the loop.) So, be careful with time measurements in loops. This is not as simple as it might seem.As it has been pointed out, the optimizer has more troubles to understand
std::function
and doesn't move the call out of the loop. So 1241ms is a fair measurement forcalc2
.Notice that,
std::function
is able to store different types of callable objects. Hence, it must perform some type-erasure magic for the storage. Generally, this implies a dynamic memory allocation (by default through a call tonew
). It's well known that this is a quite costly operation.The standard (20.8.11.2.1/5) encorages implementations to avoid the dynamic memory allocation for small objects which, thankfully, VS2012 does (in particular, for the original code).
To get an idea of how much slower it can get when memory allocation is involved, I've changed the lambda expression to capture three
float
s. This makes the callable object too big to apply the small object optimization:For this version, the time is approximately 16000ms (compared to 1241ms for the original code).
Finally, notice that the lifetime of the lambda encloses that of the
std::function
. In this case, rather than storing a copy of the lambda,std::function
could store a "reference" to it. By "reference" I mean astd::reference_wrapper
which is easily build by functionsstd::ref
andstd::cref
. More precisely, by using:the time decreases to approximately 1860ms.
I wrote about that a while ago:
http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059
As I said in the article, the arguments don't quite apply for VS2010 due to its poor support to C++11. At the time of the writing, only a beta version of VS2012 was available but its support for C++11 was already good enough for this matter.