When I execute the following code in release mode on my machine the execution of a delegate with a non null target is always slightly faster than when the delegate has a null target (I expected it to be equivalent or slower).
I'm really not looking for micro optimization but I was wondering why this is the case?
static void Main(string[] args)
{
// Warmup code
long durationWithTarget =
MeasureDuration(() => new DelegatePerformanceTester(withTarget: true).Run());
Console.WriteLine($"With target: {durationWithTarget}");
long durationWithoutTarget =
MeasureDuration(() => new DelegatePerformanceTester(withTarget: false).Run());
Console.WriteLine($"Without target: {durationWithoutTarget}");
}
/// <summary>
/// Measures the duration of an action.
/// </summary>
/// <param name="action">Action which duration has to be measured.</param>
/// <returns>The duration in milliseconds.</returns>
private static long MeasureDuration(Action action)
{
Stopwatch stopwatch = Stopwatch.StartNew();
action();
return stopwatch.ElapsedMilliseconds;
}
class DelegatePerformanceTester
{
public DelegatePerformanceTester(bool withTarget)
{
if (withTarget)
{
_func = AddNotStatic;
}
else
{
_func = AddStatic;
}
}
private readonly Func<double, double, double> _func;
private double AddNotStatic(double x, double y) => x + y;
private static double AddStatic(double x, double y) => x + y;
public void Run()
{
const int loops = 1000000000;
for (int i = 0; i < loops; i++)
{
double funcResult = _func.Invoke(1d, 2d);
}
}
}
I'll write this one up, there is pretty decent programming advice behind it that ought to matter to any C# programmer that cares about writing fast code. I in general caution about using micro-benchmarks, differences of 15% or less are not in general statistically significant due to the unpredictability of code execution speed on a modern CPU core. A good approach to reduce the odds of measuring something that is not there is to repeat a test at least 10 times to remove caching effects and to swap a test so that code alignment effects can be eliminated.
But what you saw is real, delegates that invoke a static method are in fact slower. The effect is quite small in x86 code but it is significantly worse in x64 code, be sure to tinker with the Project > Properties > Build tab > Prefer 32-bit and Platform target settings to try both.
Knowing why it is slower requires looking at the machine code that the jitter generates. In the case of delegates, that code is very well hidden. You will not see it when you look at the code with Debug > Windows > Disassembly. And you can't even single-step through the code, the managed debugger was written to hide it and completely refuses to show it. I'll have to describe a technique to put the "visual" back into Visual Studio.
I have to talk a bit about "stubs". A stub is a little sliver of machine code that the CLR dynamically creates in addition to the code that the jitter generates. Stubs are used to implement interfaces, they provide the flexibility that the order of the methods in the method table for a class does not have to match the order of the interface methods. And they matter for delegates, the subject of this question. Stubs also matter to just-in-time compilation, the initial code in a stub points to an entrypoint into the jitter to get a method compiled when it is invoked. After which the stub is replaced, now calling the jitted target method. It is the stub that makes the static method call slower, the stub for the static method target is more elaborate than the stub for an instance method.
To see the stubs, you have to wrangle the debugger to force it to show their code. Some setup is required: first use Tools > Options > Debugging > General. Untick the "Just My Code" checkbox, untick the "Suppress JIT optimization" checkbox. If you use VS2015 then tick "Use Managed Compatibility Mode", the VS2015 debugger is very buggy and gets seriously in the way for this kind of debugging, this option provides a workaround by forcing the VS2010 managed debugger engine to be used. Switch to the Release configuration. Then Project > Properties > Debug, tick the "Enable native code debugging" checkbox. And Project > Properties > Build, untick the "Prefer 32-bit" checkbox and "Platform target" should be AnyCPU.
Set a breakpoint on the Run() method, beware that breakpoints are not very accurate in optimized code. Setting on the method header is best. Once it hits, use Debug > Windows > Disassembly to see the machine code that the jitter generated. The delegate invoke call looks like this on a Haswell core, might not match what you see if you have an older processor that doesn't support AVX yet:
funcResult += _func.Invoke(1d, 2d);
0000001a mov rax,qword ptr [rsi+8] ; rax = _func
0000001e mov rcx,qword ptr [rax+8] ; rcx = _func._methodBase (?)
00000022 vmovsd xmm2,qword ptr [0000000000000070h] ; arg3 = 2d
0000002b vmovsd xmm1,qword ptr [0000000000000078h] ; arg2 = 1d
00000034 call qword ptr [rax+18h] ; call stub
A 64-bit method call passes the first 4 arguments in registers, any additional arguments are passed through the stack (not here). The XMM registers are used here because the arguments are floating point. At this point the jitter cannot know yet whether the method is static or instance, that can't be found out until this code actually executes. It is the job of the stub to hide the difference. It assumes it will be an instance method, that's why I annotated arg2 and arg3.
Set a breakpoint on the CALL instruction, the second time it hits (so after the stub no longer points into the jitter) you can have a look at it. That has to be done by hand, use Debug > Windows > Registers and copy the value of the RAX register. Debug > Windows > Memory > Memory1 and paste the value, put "0x" in front of it and add 0x18. Right-click that window and select "8-byte Integer", copy the first displayed value. That is the address of the stub code.
Now the trick, at this point the managed debugging engine is still being used and will not allow you to look at the stub code. You have to force a mode switch so the unmanaged debugging engine is in control. Use Debug > Windows > Call Stack and double-click a method call on the bottom, like RtlUserThreadStart. Forces the debugger to switch engines. Now you are good to go and can paste the address in the Address box, put "0x" in front of it. Out pops the stub code:
00007FFCE66D0100 jmp 00007FFCE66D0E40
Very simple one, a straight jump to the delegate target method. This will be fast code. The jitter guessed correctly at an instance method and the delegate object already provided the this
argument in the RCX register so nothing special needs to be done.
Proceed to the second test and do the exact same thing to look at the stub for the instance call. Now the stub is very different:
000001FE559F0850 mov rax,rsp ; ?
000001FE559F0853 mov r11,rcx ; r11 = _func (?)
000001FE559F0856 movaps xmm0,xmm1 ; shuffle arg3 into right register
000001FE559F0859 movaps xmm1,xmm2 ; shuffle arg2 into right register
000001FE559F085C mov r10,qword ptr [r11+20h] ; r10 = _func.Method
000001FE559F0860 add r11,20h ; ?
000001FE559F0864 jmp r10 ; jump to _func.Method
The code is a bit wonky and not optimal, Microsoft could probably do a better job here, and I'm not 100% sure I annotated it correctly. I guess that the unnecessary mov rax,rsp instruction is only relevant for stubs to methods with more than 4 arguments. No idea why the add instruction is necessary. Most important detail that matters are the XMM register moves, it has to reshuffle them because the static method does not have the this
argument. It is this reshuffling requirement that makes the code slower.
You can do the same exercise with the x86 jitter, the static method stub now looks like:
04F905B4 mov eax,ecx
04F905B6 add eax,10h
04F905B9 jmp dword ptr [eax] ; jump to _func.Method
Much simpler than the 64-bit stub, which is why 32-bit code does not suffer from the slowdown nearly as much. One reason it is so very different is that 32-bit code passes floating point on the FPU stack and they don't have to be reshuffled. This won't necessarily be faster when you use integral or object arguments.
Very arcane, hope I didn't put everybody to sleep yet. Beware I might have gotten some annotations wrong, I don't fully understand stubs and the way the CLR cooks delegate object members to make code as fast as possible. But there is certainly decent programming advice here. You really do favor instance methods as delegate targets, making them static
is not an optimization.