I have a function similar to the following:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void SetVariable<T>(T newValue) where T : struct {
// I know by this point that T is blittable (i.e. only unmanaged value types)
// varPtr is a void*, and is where I want to copy newValue to
*varPtr = newValue; // This won't work, but is basically what I want to do
}
I saw Marshal.StructureToIntPtr(), but it seems quite slow, and this is performance-sensitive code. If I knew the type T
I could just declare varPtr
as a T*
, but... Well, I don't.
Either way, I'm after the fastest possible way to do this. 'Safety' is not a concern: By this point in the code, I know that the size of the struct T
will fit exactly in to the memory pointed to by varPtr
.
One answer is to reimplement native memcpy instead in C#, making use of the same optimizing tricks that native memcpy attempts to do. You can see Microsoft doing this in their own source. See the Buffer.cs file in the Microsoft Reference Source:
// This is tricky to get right AND fast, so lets make it useful for the whole Fx.
// E.g. System.Runtime.WindowsRuntime!WindowsRuntimeBufferExtensions.MemCopy uses it.
internal unsafe static void Memcpy(byte* dest, byte* src, int len) {
// This is portable version of memcpy. It mirrors what the hand optimized assembly versions of memcpy typically do.
// Ideally, we would just use the cpblk IL instruction here. Unfortunately, cpblk IL instruction is not as efficient as
// possible yet and so we have this implementation here for now.
switch (len)
{
case 0:
return;
case 1:
*dest = *src;
return;
case 2:
*(short *)dest = *(short *)src;
return;
case 3:
*(short *)dest = *(short *)src;
*(dest + 2) = *(src + 2);
return;
case 4:
*(int *)dest = *(int *)src;
return;
...
Its interesting to note that they natively implement memcpy for all sizes up to 512; most of the sizes use pointer aliasing tricks to get the VM to emit instructions that operate on differing sizes. Only at 512 do they finally drop into invoking the native memcpy:
// P/Invoke into the native version for large lengths
if (len >= 512)
{
_Memcpy(dest, src, len);
return;
}
Presumably, native memcpy is even faster since it can be hand optimized to use SSE/MMX instructions to perform the copy.
As per BenVoigt's suggestion, I tried a few options. For all these tests I compiled with Any CPU architecture, on a standard VS2013 Release build, and ran the test outside of the IDE. Before each test was measured, the methods DoTestA()
and DoTestB()
were run multiple times to allow the JIT warmup.
First, I compared Marshal.StructToPtr
to a byte-by-byte loop with various struct sizes. I've shown the code below using a SixtyFourByteStruct
:
private unsafe static void DoTestA() {
fixed (SixtyFourByteStruct* fixedStruct = &structToCopy) {
byte* structStart = (byte*) fixedStruct;
byte* targetStart = (byte*) unmanagedTarget;
for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(SixtyFourByteStruct); ++structPtr, ++targetPtr) {
*targetPtr = *structPtr;
}
}
}
private static void DoTestB() {
Marshal.StructureToPtr(structToCopy, unmanagedTarget, false);
}
And the results:
>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method Avg. Min. Max. Jitter Total
A 82ns 0ns 22,000ns 21,917ns ! 41.017ms
B 137ns 0ns 38,700ns 38,562ns ! 68.834ms
As you can see, the manual loop is faster (as I suspected). The results are similar for a sixteen-byte and four-byte struct, with the difference being more pronounced the smaller the struct goes.
So now, to try the manual copy vs using P/Invoke and memcpy:
private unsafe static void DoTestA() {
fixed (FourByteStruct* fixedStruct = &structToCopy) {
byte* structStart = (byte*) fixedStruct;
byte* targetStart = (byte*) unmanagedTarget;
for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(FourByteStruct); ++structPtr, ++targetPtr) {
*targetPtr = *structPtr;
}
}
}
private unsafe static void DoTestB() {
fixed (FourByteStruct* fixedStruct = &structToCopy) {
memcpy(unmanagedTarget, (IntPtr) fixedStruct, new UIntPtr((uint) sizeof(FourByteStruct)));
}
}
>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method Avg. Min. Max. Jitter Total
A 61ns 0ns 28,000ns 27,938ns ! 30.736ms
B 84ns 0ns 45,900ns 45,815ns ! 42.216ms
So, it seems that the manual copy is still better in my case. Like before, the results were pretty similar for 4/16/64 byte structs (though the gap was <10ns for 64-byte size).
It occurred to me that I was only testing structures that fit on a cache line (I have a standard x86_64 CPU). So I tried a 128-byte structure, and it swung the balance in the favour of memcpy:
>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method Avg. Min. Max. Jitter Total
A 104ns 0ns 48,300ns 48,195ns ! 52.150ms
B 84ns 0ns 38,400ns 38,315ns ! 42.284ms
Anyway, the conclusion to all that is that the byte-by-byte copy seems the fastest for any struct of size <=64 bytes on an x86_64 CPU on my machine. Take it as you will (and maybe someone will spot an inefficiency in my code anyway).
FYI. I'm posting how I leveraged the accepted answer for others' benefit as there's a twist when accessing the method via reflection because it's overloaded.
public static class Buffer
{
public unsafe delegate void MemcpyDelegate(byte* dest, byte* src, int len);
public static readonly MemcpyDelegate Memcpy;
static Buffer()
{
var methods = typeof (System.Buffer).GetMethods(BindingFlags.Static | BindingFlags.NonPublic).Where(m=>m.Name == "Memcpy");
var memcpy = methods.First(mi => mi.GetParameters().Select(p => p.ParameterType).SequenceEqual(new[] {typeof (byte*), typeof (byte*), typeof (int)}));
Memcpy = (MemcpyDelegate) memcpy.CreateDelegate(typeof (MemcpyDelegate));
}
}
Usage:
public static unsafe void MemcpyExample()
{
int src = 12345;
int dst = 0;
Buffer.Memcpy((byte*) &dst, (byte*) &src, sizeof (int));
System.Diagnostics.Debug.Assert(dst==12345);
}
public void SetVariable<T>(T newValue) where T : struct
You cannot use generics to accomplish this the fast way. The compiler doesn't take your pretty blue eyes as a guarantee that T is actually blittable, the constraint isn't good enough. You should use overloads:
public unsafe void SetVariable(int newValue) {
*(int*)varPtr = newValue;
}
public unsafe void SetVariable(double newValue) {
*(double*)varPtr = newValue;
}
public unsafe void SetVariable(Point newValue) {
*(Point*)varPtr = newValue;
}
// etc...
Which might be inconvenient, but blindingly fast. It compiles to single MOV instruction with no method call overhead in Release mode. The fastest it could be.
And the back-up case, the profiler will tell you when you need to overload:
public unsafe void SetVariable<T>(T newValue) {
Marshal.StructureToPtr(newValue, (IntPtr)varPtr, false);
}