Why is a Func<>
created from an Expression<Func<>>
via .Compile() considerably slower than just using a Func<>
declared directly ?
I just changed from using a Func<IInterface, object>
declared directly to one created from an Expression<Func<IInterface, object>>
in an app i am working on and i noticed that the performance went down.
I have just done a little test, and the Func<>
created from an Expression takes "almost" double the time of an Func<>
declared directly.
On my machine the Direct Func<>
takes about 7.5 seconds and the Expression<Func<>>
takes about 12.6 seconds.
Here is the test code I used (running Net 4.0)
// Direct
Func<int, Foo> test1 = x => new Foo(x * 2);
int counter1 = 0;
Stopwatch s1 = new Stopwatch();
s1.Start();
for (int i = 0; i < 300000000; i++)
{
counter1 += test1(i).Value;
}
s1.Stop();
var result1 = s1.Elapsed;
// Expression . Compile()
Expression<Func<int, Foo>> expression = x => new Foo(x * 2);
Func<int, Foo> test2 = expression.Compile();
int counter2 = 0;
Stopwatch s2 = new Stopwatch();
s2.Start();
for (int i = 0; i < 300000000; i++)
{
counter2 += test2(i).Value;
}
s2.Stop();
var result2 = s2.Elapsed;
public class Foo
{
public Foo(int i)
{
Value = i;
}
public int Value { get; set; }
}
How can i get the performance back ?
Is there anything i can do to get the Func<>
created from the Expression<Func<>>
to perform like one declared directly ?
(This is not a proper answer, but is material intended to help discover the answer.)
Statistics gathered from Mono 2.6.7 - Debian Lenny - Linux 2.6.26 i686 - 2.80GHz single core:
So on Mono at least both mechanisms appear to generate equivalent IL.
This is the IL generated by Mono's
gmcs
for the anonymous method:I will work on extracting the IL generated by the expression compiler.
As others have mentioned, the overhead of calling a dynamic delegate is causing your slowdown. On my computer that overhead is about 12ns with my CPU at 3GHz. The way to get around that is to load the method from a compiled assembly, like this:
When I add the above code,
result3
is always just a fraction of a second higher thanresult1
, for about a 1ns overhead.So why even bother with a compiled lambda (
test2
) when you can have a faster delegate (test3
)? Because creating the dynamic assembly is much more overhead in general, and only saves you 10-20ns on each invocation.I was interested in the answer by Michael B. so I added in each case extra call before stopwatch even started. In debug mode the compile (case 2) method was faster nearly two times (6 seconds to 10 seconds), and in release mode both versions both version was on par (the difference was about ~0.2 second).
Now, what is striking to me, that with JIT put out of the equation I got the opposite results than Martin.
Edit: Initially I missed the Foo, so the results above are for Foo with field, not a property, with original Foo the comparison is the same, only times are bigger -- 15 seconds for direct func, 12 seconds for compiled version. Again, in release mode the times are similar, now the difference is about ~0.5.
However this indicates, that if your expression is more complex, even in release mode there will be real difference.
It is most likely because the first invocation of the code was not jitted. I decided to look at the IL and they are virtually identical.
This code gets us the byte arrays and prints them to the console. Here is the output on my machine::
And here is reflector's version of the first function::
There are only 2 bytes different in the entire method! They are the first opcode, which is for the first method, ldarg0 (load the first argument), but on the second method ldarg1 (load the second argument). The difference here is because an expression generated object actually has a target of a
Closure
object. This can also factor in.The next opcode for both is ldc.i4.2 (24) which means load 2 onto the stack, the next is the opcode for
mul
(90), the next opcode is thenewobj
opcode (115). The next 4 bytes are the metadata token for the.ctor
object. They are different as the two methods are actually hosted in different assemblies. The anonymous method is in an anonymous assembly. Unfortunately, I haven't quite gotten to the point of figuring out how to resolve these tokens. The final opcode is 42 which isret
. Every CLI function must end withret
even functions that don't return anything.There are few possibilities, the closure object is somehow causing things to be slower, which might be true (but unlikely), the jitter didn't jit the method and since you were firing in rapid spinning succession it didn't have to time to jit that path, invoking a slower path. The C# compiler in vs may also be emitting different calling conventions, and
MethodAttributes
which may act as hints to the jitter to perform different optimizations.Ultimately, I would not even remotely worry about this difference. If you really are invoking your function 3 billion times in the course of your application, and the difference being incurred is 5 whole seconds, you're probably going to be ok.
Ultimately what it comes down to is that
Expression<T>
is not a pre compiled delegate. It's only an expression tree. Calling Compile on aLambdaExpression
(which is whatExpression<T>
actually is) generates IL code at runtime and creates something akin to aDynamicMethod
for it.If you just use a
Func<T>
in code, it pre compiles it just like any other delegate reference.So there are 2 sources of slowness here:
The initial compilation time to compile
Expression<T>
into a delegate. This is huge. If you're doing this for every invocation - definitely don't (but this isn't the case, since you're using your Stopwatch after you call compile.It's a
DynamicMethod
basically after you call Compile.DynamicMethod
s (even strongly typed delegates for ones) ARE in fact slower to execute than direct calls.Func<T>
s resolved at compile time are direct calls. There's performance comparisons out there between dynamically emitted IL and compile time emitted IL. Random URL: http://www.codeproject.com/KB/cs/dynamicmethoddelegates.aspx?msg=1160046...Also, in your stopwatch test for the
Expression<T>
, you should start your timer when i = 1, not 0... I believe your compiled Lambda will not be JIT compiled until the first invocation, so there will be a performance hit for that first call.Just for the record: I can reproduce the numbers with the code above.
One thing to note is that both delegates create a new instance of Foo for every iteration. This could be more important than how the delegates are created. Not only does that lead to a lot of heap allocations, but GC may also affect the numbers here.
If I change the code to
and
The performance numbers are virtually identical (actually result2 is a little better than result1). This supports the theory that the expensive part is heap allocations and/or collections and not how the delegate is constructed.
UPDATE
Following the comment from Gabe, I tried changing
Foo
to be a struct. Unfortunately this yields more or less the same numbers as the original code, so perhaps heap allocation/garbage collection is not the cause after all.However, I also verified the numbers for delegates of the type
Func<int, int>
and they are quite similar and much lower than the numbers for the original code.I'll keep digging and look forward to seeing more/updated answers.