.net delegate without target slower than with targ

When I execute the following code in release mode on my machine the execution of a delegate with a non null target is always slightly faster than when the delegate has a null target (I expected it to be equivalent or slower).

I'm really not looking for micro optimization but I was wondering why this is the case?

static void Main(string[] args)
{
    // Warmup code

    long durationWithTarget = 
        MeasureDuration(() => new DelegatePerformanceTester(withTarget: true).Run());

    Console.WriteLine($"With target: {durationWithTarget}");

    long durationWithoutTarget = 
        MeasureDuration(() => new DelegatePerformanceTester(withTarget: false).Run());

    Console.WriteLine($"Without target: {durationWithoutTarget}");
}

/// <summary>
/// Measures the duration of an action.
/// </summary>
/// <param name="action">Action which duration has to be measured.</param>
/// <returns>The duration in milliseconds.</returns>
private static long MeasureDuration(Action action)
{
    Stopwatch stopwatch = Stopwatch.StartNew();

    action();

    return stopwatch.ElapsedMilliseconds;
}

class DelegatePerformanceTester
{
    public DelegatePerformanceTester(bool withTarget)
    {
        if (withTarget)
        {
            _func = AddNotStatic;
        }
        else
        {
            _func = AddStatic;
        }
    }
    private readonly Func<double, double, double> _func;

    private double AddNotStatic(double x, double y) => x + y;
    private static double AddStatic(double x, double y) => x + y;

    public void Run()
    {
        const int loops = 1000000000;
        for (int i = 0; i < loops; i++)
        {
            double funcResult = _func.Invoke(1d, 2d);
        }
    }
}

I'll write this one up, there is pretty decent programming advice behind it that ought to matter to any C# programmer that cares about writing fast code. I in general caution about using micro-benchmarks, differences of 15% or less are not in general statistically significant due to the unpredictability of code execution speed on a modern CPU core. A good approach to reduce the odds of measuring something that is not there is to repeat a test at least 10 times to remove caching effects and to swap a test so that code alignment effects can be eliminated.

But what you saw is real, delegates that invoke a static method are in fact slower. The effect is quite small in x86 code but it is significantly worse in x64 code, be sure to tinker with the Project > Properties > Build tab > Prefer 32-bit and Platform target settings to try both.

Knowing why it is slower requires looking at the machine code that the jitter generates. In the case of delegates, that code is very well hidden. You will not see it when you look at the code with Debug > Windows > Disassembly. And you can't even single-step through the code, the managed debugger was written to hide it and completely refuses to show it. I'll have to describe a technique to put the "visual" back into Visual Studio.

I have to talk a bit about "stubs". A stub is a little sliver of machine code that the CLR dynamically creates in addition to the code that the jitter generates. Stubs are used to implement interfaces, they provide the flexibility that the order of the methods in the method table for a class does not have to match the order of the interface methods. And they matter for delegates, the subject of this question. Stubs also matter to just-in-time compilation, the initial code in a stub points to an entrypoint into the jitter to get a method compiled when it is invoked. After which the stub is replaced, now calling the jitted target method. It is the stub that makes the static method call slower, the stub for the static method target is more elaborate than the stub for an instance method.

To see the stubs, you have to wrangle the debugger to force it to show their code. Some setup is required: first use Tools > Options > Debugging > General. Untick the "Just My Code" checkbox, untick the "Suppress JIT optimization" checkbox. If you use VS2015 then tick "Use Managed Compatibility Mode", the VS2015 debugger is very buggy and gets seriously in the way for this kind of debugging, this option provides a workaround by forcing the VS2010 managed debugger engine to be used. Switch to the Release configuration. Then Project > Properties > Debug, tick the "Enable native code debugging" checkbox. And Project > Properties > Build, untick the "Prefer 32-bit" checkbox and "Platform target" should be AnyCPU.

Set a breakpoint on the Run() method, beware that breakpoints are not very accurate in optimized code. Setting on the method header is best. Once it hits, use Debug > Windows > Disassembly to see the machine code that the jitter generated. The delegate invoke call looks like this on a Haswell core, might not match what you see if you have an older processor that doesn't support AVX yet:

                funcResult += _func.Invoke(1d, 2d);
0000001a  mov         rax,qword ptr [rsi+8]               ; rax = _func              
0000001e  mov         rcx,qword ptr [rax+8]               ; rcx = _func._methodBase (?)
00000022  vmovsd      xmm2,qword ptr [0000000000000070h]  ; arg3 = 2d
0000002b  vmovsd      xmm1,qword ptr [0000000000000078h]  ; arg2 = 1d
00000034  call        qword ptr [rax+18h]                 ; call stub

A 64-bit method call passes the first 4 arguments in registers, any additional arguments are passed through the stack (not here). The XMM registers are used here because the arguments are floating point. At this point the jitter cannot know yet whether the method is static or instance, that can't be found out until this code actually executes. It is the job of the stub to hide the difference. It assumes it will be an instance method, that's why I annotated arg2 and arg3.

Set a breakpoint on the CALL instruction, the second time it hits (so after the stub no longer points into the jitter) you can have a look at it. That has to be done by hand, use Debug > Windows > Registers and copy the value of the RAX register. Debug > Windows > Memory > Memory1 and paste the value, put "0x" in front of it and add 0x18. Right-click that window and select "8-byte Integer", copy the first displayed value. That is the address of the stub code.

Now the trick, at this point the managed debugging engine is still being used and will not allow you to look at the stub code. You have to force a mode switch so the unmanaged debugging engine is in control. Use Debug > Windows > Call Stack and double-click a method call on the bottom, like RtlUserThreadStart. Forces the debugger to switch engines. Now you are good to go and can paste the address in the Address box, put "0x" in front of it. Out pops the stub code:

  00007FFCE66D0100  jmp         00007FFCE66D0E40

Very simple one, a straight jump to the delegate target method. This will be fast code. The jitter guessed correctly at an instance method and the delegate object already provided the this argument in the RCX register so nothing special needs to be done.

Proceed to the second test and do the exact same thing to look at the stub for the instance call. Now the stub is very different:

000001FE559F0850  mov         rax,rsp                 ; ?
000001FE559F0853  mov         r11,rcx                 ; r11 = _func (?)
000001FE559F0856  movaps      xmm0,xmm1               ; shuffle arg3 into right register
000001FE559F0859  movaps      xmm1,xmm2               ; shuffle arg2 into right register
000001FE559F085C  mov         r10,qword ptr [r11+20h] ; r10 = _func.Method 
000001FE559F0860  add         r11,20h                 ; ?
000001FE559F0864  jmp         r10                     ; jump to _func.Method

The code is a bit wonky and not optimal, Microsoft could probably do a better job here, and I'm not 100% sure I annotated it correctly. I guess that the unnecessary mov rax,rsp instruction is only relevant for stubs to methods with more than 4 arguments. No idea why the add instruction is necessary. Most important detail that matters are the XMM register moves, it has to reshuffle them because the static method does not have the this argument. It is this reshuffling requirement that makes the code slower.

You can do the same exercise with the x86 jitter, the static method stub now looks like:

04F905B4  mov         eax,ecx  
04F905B6  add         eax,10h  
04F905B9  jmp         dword ptr [eax]      ; jump to _func.Method

Much simpler than the 64-bit stub, which is why 32-bit code does not suffer from the slowdown nearly as much. One reason it is so very different is that 32-bit code passes floating point on the FPU stack and they don't have to be reshuffled. This won't necessarily be faster when you use integral or object arguments.

Very arcane, hope I didn't put everybody to sleep yet. Beware I might have gotten some annotations wrong, I don't fully understand stubs and the way the CLR cooks delegate object members to make code as fast as possible. But there is certainly decent programming advice here. You really do favor instance methods as delegate targets, making them static is not an optimization.