I'm trying to call a native function from a managed assembly. I've done this on pre-compiled libraries and everything has went well. At this moment I'm building my own library, and I can't get this work.
The native DLL source is the following:
#define DERM_SIMD_EXPORT __declspec(dllexport)
#define DERM_SIMD_API __cdecl
extern "C" {
DERM_SIMD_EXPORT void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right);
}
void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right) {
__asm {
....
}
}
Hereafter we have the managed code which loads the library and create a delegate from a function pointer.
public unsafe class Simd
{
[UnmanagedFunctionPointer(CallingConvention.Cdecl)]
public delegate void MatrixMultiplyDelegate(float* result, float* left, float* right);
public static MatrixMultiplyDelegate MatrixMultiply;
public static void LoadSimdExtensions()
{
string assemblyPath = "Derm.Simd.dll";
IntPtr address = GetProcAddress.GetAddress(assemblyPath, "Matrix4x4_Multiply_SSE");
if (address != IntPtr.Zero) {
MatrixMultiply = (MatrixMultiplyDelegate)Marshal.GetDelegateForFunctionPointer(address, typeof(MatrixMultiplyDelegate));
}
}
}
Using the sources above the code runs without errors (the function pointer is obtained, and the delegate is actually created.
The problem raises when I call the delegate: it is executed (and I can debug it also!), but at function exit the managed application raises a System.ExecutionEngineException (when it doesn't exit without exceptions).
The actual problem is the function implementation: it contains a asm block with SSE instructions; if I remove the asm block, the code works perfectly.
I suspect I am missing some registry save/restore assembly, but I'm completly ignorant on this side.
The strange thing is that if I change the calling convention to __stdcall, the debug version "seems" to work, while the release version behave as if __cdecl calling convetion was used.
(And just because here we are, can you clarify if the calling convetion matters?)
Ok, thank to the David Heffernan comment I find out that the bad instructions causing the problem are the following:
movups result[ 0], xmm4;
movups result[16], xmm5;
movups instructions moves 16 bytes into (unaligned) memory.
The function is called by the following code:
unsafe {
float* prodFix = (float*)prod.MatrixBuffer.AlignedBuffer.ToPointer();
float* m1Fix = (float*)m2.MatrixBuffer.AlignedBuffer.ToPointer();
float* m2Fix = (float*)m1.MatrixBuffer.AlignedBuffer.ToPointer();
if (Simd.Simd.MatrixMultiply == null) {
// ... unsafe C# code
} else {
Simd.Simd.MatrixMultiply(prodFix, m1Fix, m2Fix);
}
}
Where MatrixBuffer is a class of mine; its member AlignedBuffer is allocated in the following way:
// Allocate unmanaged buffer
mUnmanagedBuffer = Marshal.AllocHGlobal(new IntPtr((long)(size + alignment - 1)));
// Align buffer pointer
long misalignment = mUnmanagedBuffer.ToInt64() % alignment;
if (misalignment != 0)
mAlignedBuffer = new IntPtr(mUnmanagedBuffer.ToInt64() + misalignment);
else
mAlignedBuffer = mUnmanagedBuffer;
Maybe the error is caused by Marshal.AllocHGlobal or IntPtr black magic?
This is the minimal source to spot the error:
void Matrix4x4_Multiply_SSE(float *result, float *left, float *right)
{
__asm {
movups xmm0, right[ 0];
movups result, xmm0;
}
}
int main(int argc, char *argv[])
{
float r0[16];
float m1[16], m2[16];
m1[ 0] = 1.0f; m1[ 4] = 0.0f; m1[ 8] = 0.0f; m1[12] = 0.0f;
m1[ 1] = 0.0f; m1[ 5] = 1.0f; m1[ 9] = 0.0f; m1[13] = 0.0f;
m1[ 2] = 0.0f; m1[ 6] = 0.0f; m1[10] = 1.0f; m1[14] = 0.0f;
m1[ 3] = 0.0f; m1[ 7] = 0.0f; m1[11] = 0.0f; m1[15] = 1.0f;
m2[ 0] = 1.0f; m2[ 4] = 0.0f; m2[ 8] = 0.0f; m2[12] = 0.0f;
m2[ 1] = 0.0f; m2[ 5] = 1.0f; m2[ 9] = 0.0f; m2[13] = 0.0f;
m2[ 2] = 0.0f; m2[ 6] = 0.0f; m2[10] = 1.0f; m2[14] = 0.0f;
m2[ 3] = 0.0f; m2[ 7] = 0.0f; m2[11] = 0.0f; m2[15] = 1.0f;
r0[ 0] = 0.0f; r0[ 4] = 0.0f; r0[ 8] = 0.0f; r0[12] = 0.0f;
r0[ 1] = 0.0f; r0[ 5] = 0.0f; r0[ 9] = 0.0f; r0[13] = 0.0f;
r0[ 2] = 0.0f; r0[ 6] = 0.0f; r0[10] = 0.0f; r0[14] = 0.0f;
r0[ 3] = 0.0f; r0[ 7] = 0.0f; r0[11] = 0.0f; r0[15] = 0.0f;
Matrix4x4_Multiply_SSE(r0, m1, m2);
Matrix4x4_Multiply_SSE(r0, m1, m2);
return (0);
}
Pratically after the second movups, the stack changes the result value (stored on the stack), and stores the values of xmm0 on the modified (and wrong) address stored in result.
After having stepped out from *Matrix4x4_Multiply_SSE*, the original memory isn't modified.
What am I missing?