Fastest way to copy a blittable struct to an unman

2019-04-13 07:23发布

问题:

I have a function similar to the following:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void SetVariable<T>(T newValue) where T : struct {
    // I know by this point that T is blittable (i.e. only unmanaged value types)

    // varPtr is a void*, and is where I want to copy newValue to
    *varPtr = newValue; // This won't work, but is basically what I want to do
}

I saw Marshal.StructureToIntPtr(), but it seems quite slow, and this is performance-sensitive code. If I knew the type T I could just declare varPtr as a T*, but... Well, I don't.

Either way, I'm after the fastest possible way to do this. 'Safety' is not a concern: By this point in the code, I know that the size of the struct T will fit exactly in to the memory pointed to by varPtr.

回答1:

One answer is to reimplement native memcpy instead in C#, making use of the same optimizing tricks that native memcpy attempts to do. You can see Microsoft doing this in their own source. See the Buffer.cs file in the Microsoft Reference Source:

     // This is tricky to get right AND fast, so lets make it useful for the whole Fx.
     // E.g. System.Runtime.WindowsRuntime!WindowsRuntimeBufferExtensions.MemCopy uses it.
     internal unsafe static void Memcpy(byte* dest, byte* src, int len) {

        // This is portable version of memcpy. It mirrors what the hand optimized assembly versions of memcpy typically do.
        // Ideally, we would just use the cpblk IL instruction here. Unfortunately, cpblk IL instruction is not as efficient as
        // possible yet and so we have this implementation here for now.

        switch (len)
        {
        case 0:
            return;
        case 1:
            *dest = *src;
            return;
        case 2:
            *(short *)dest = *(short *)src;
            return;
        case 3:
            *(short *)dest = *(short *)src;
            *(dest + 2) = *(src + 2);
            return;
        case 4:
            *(int *)dest = *(int *)src;
            return;
        ...

Its interesting to note that they natively implement memcpy for all sizes up to 512; most of the sizes use pointer aliasing tricks to get the VM to emit instructions that operate on differing sizes. Only at 512 do they finally drop into invoking the native memcpy:

        // P/Invoke into the native version for large lengths
        if (len >= 512)
        {
            _Memcpy(dest, src, len);
            return;
        }

Presumably, native memcpy is even faster since it can be hand optimized to use SSE/MMX instructions to perform the copy.



回答2:

As per BenVoigt's suggestion, I tried a few options. For all these tests I compiled with Any CPU architecture, on a standard VS2013 Release build, and ran the test outside of the IDE. Before each test was measured, the methods DoTestA() and DoTestB() were run multiple times to allow the JIT warmup.


First, I compared Marshal.StructToPtr to a byte-by-byte loop with various struct sizes. I've shown the code below using a SixtyFourByteStruct:

private unsafe static void DoTestA() {
    fixed (SixtyFourByteStruct* fixedStruct = &structToCopy) {
        byte* structStart = (byte*) fixedStruct;
        byte* targetStart = (byte*) unmanagedTarget;
        for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(SixtyFourByteStruct); ++structPtr, ++targetPtr) {
            *targetPtr = *structPtr;
        }
    }
}

private static void DoTestB() {
    Marshal.StructureToPtr(structToCopy, unmanagedTarget, false);
}

And the results:

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        82ns         0ns          22,000ns     21,917ns     ! 41.017ms
B        137ns        0ns          38,700ns     38,562ns     ! 68.834ms

As you can see, the manual loop is faster (as I suspected). The results are similar for a sixteen-byte and four-byte struct, with the difference being more pronounced the smaller the struct goes.


So now, to try the manual copy vs using P/Invoke and memcpy:

private unsafe static void DoTestA() {
    fixed (FourByteStruct* fixedStruct = &structToCopy) {
        byte* structStart = (byte*) fixedStruct;
        byte* targetStart = (byte*) unmanagedTarget;
        for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(FourByteStruct); ++structPtr, ++targetPtr) {
            *targetPtr = *structPtr;
        }
    }
}

private unsafe static void DoTestB() {
    fixed (FourByteStruct* fixedStruct = &structToCopy) {
        memcpy(unmanagedTarget, (IntPtr) fixedStruct, new UIntPtr((uint) sizeof(FourByteStruct)));
    }
}

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        61ns         0ns          28,000ns     27,938ns     ! 30.736ms
B        84ns         0ns          45,900ns     45,815ns     ! 42.216ms

So, it seems that the manual copy is still better in my case. Like before, the results were pretty similar for 4/16/64 byte structs (though the gap was <10ns for 64-byte size).


It occurred to me that I was only testing structures that fit on a cache line (I have a standard x86_64 CPU). So I tried a 128-byte structure, and it swung the balance in the favour of memcpy:

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        104ns        0ns          48,300ns     48,195ns     ! 52.150ms
B        84ns         0ns          38,400ns     38,315ns     ! 42.284ms

Anyway, the conclusion to all that is that the byte-by-byte copy seems the fastest for any struct of size <=64 bytes on an x86_64 CPU on my machine. Take it as you will (and maybe someone will spot an inefficiency in my code anyway).



回答3:

FYI. I'm posting how I leveraged the accepted answer for others' benefit as there's a twist when accessing the method via reflection because it's overloaded.

public static class Buffer
{
    public unsafe delegate void MemcpyDelegate(byte* dest, byte* src, int len);

    public static readonly MemcpyDelegate Memcpy;
    static Buffer()
    {
        var methods = typeof (System.Buffer).GetMethods(BindingFlags.Static | BindingFlags.NonPublic).Where(m=>m.Name == "Memcpy");
        var memcpy = methods.First(mi => mi.GetParameters().Select(p => p.ParameterType).SequenceEqual(new[] {typeof (byte*), typeof (byte*), typeof (int)}));
        Memcpy = (MemcpyDelegate) memcpy.CreateDelegate(typeof (MemcpyDelegate));
    }
}

Usage:

public static unsafe void MemcpyExample()
{
     int src = 12345;
     int dst = 0;
     Buffer.Memcpy((byte*) &dst, (byte*) &src, sizeof (int));
     System.Diagnostics.Debug.Assert(dst==12345);
}


回答4:

   public void SetVariable<T>(T newValue) where T : struct

You cannot use generics to accomplish this the fast way. The compiler doesn't take your pretty blue eyes as a guarantee that T is actually blittable, the constraint isn't good enough. You should use overloads:

    public unsafe void SetVariable(int newValue) {
        *(int*)varPtr = newValue;
    }
    public unsafe void SetVariable(double newValue) {
        *(double*)varPtr = newValue;
    }
    public unsafe void SetVariable(Point newValue) {
        *(Point*)varPtr = newValue;
    }
    // etc...

Which might be inconvenient, but blindingly fast. It compiles to single MOV instruction with no method call overhead in Release mode. The fastest it could be.

And the back-up case, the profiler will tell you when you need to overload:

    public unsafe void SetVariable<T>(T newValue) {
        Marshal.StructureToPtr(newValue, (IntPtr)varPtr, false);
    }