Invoking native code with hand-written assembly

I'm trying to call a native function from a managed assembly. I've done this on pre-compiled libraries and everything has went well. At this moment I'm building my own library, and I can't get this work.

The native DLL source is the following:

#define DERM_SIMD_EXPORT        __declspec(dllexport)

#define DERM_SIMD_API           __cdecl

extern "C" {

    DERM_SIMD_EXPORT void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right);

}

void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right) {
    __asm {
       ....
    }
}

Hereafter we have the managed code which loads the library and create a delegate from a function pointer.

public unsafe class Simd
{
    [UnmanagedFunctionPointer(CallingConvention.Cdecl)]
    public delegate void MatrixMultiplyDelegate(float* result, float* left, float* right);

    public static MatrixMultiplyDelegate MatrixMultiply;

    public static void LoadSimdExtensions()
    {
        string assemblyPath = "Derm.Simd.dll";

        IntPtr address = GetProcAddress.GetAddress(assemblyPath, "Matrix4x4_Multiply_SSE");

        if (address != IntPtr.Zero) {
            MatrixMultiply = (MatrixMultiplyDelegate)Marshal.GetDelegateForFunctionPointer(address, typeof(MatrixMultiplyDelegate));
        }
    }
}

Using the sources above the code runs without errors (the function pointer is obtained, and the delegate is actually created.

The problem raises when I call the delegate: it is executed (and I can debug it also!), but at function exit the managed application raises a System.ExecutionEngineException (when it doesn't exit without exceptions).

The actual problem is the function implementation: it contains a asm block with SSE instructions; if I remove the asm block, the code works perfectly.

I suspect I am missing some registry save/restore assembly, but I'm completly ignorant on this side.

The strange thing is that if I change the calling convention to __stdcall, the debug version "seems" to work, while the release version behave as if __cdecl calling convetion was used.

(And just because here we are, can you clarify if the calling convetion matters?)

Ok, thank to the David Heffernan comment I find out that the bad instructions causing the problem are the following:

 movups result[ 0], xmm4;
 movups result[16], xmm5;

movups instructions moves 16 bytes into (unaligned) memory.

The function is called by the following code:

 unsafe {
    float* prodFix = (float*)prod.MatrixBuffer.AlignedBuffer.ToPointer();
    float* m1Fix = (float*)m2.MatrixBuffer.AlignedBuffer.ToPointer();
    float* m2Fix = (float*)m1.MatrixBuffer.AlignedBuffer.ToPointer();

    if (Simd.Simd.MatrixMultiply == null) {
                    // ... unsafe C# code
    } else {
        Simd.Simd.MatrixMultiply(prodFix, m1Fix, m2Fix);
    }
}

Where MatrixBuffer is a class of mine; its member AlignedBuffer is allocated in the following way:

// Allocate unmanaged buffer
mUnmanagedBuffer = Marshal.AllocHGlobal(new IntPtr((long)(size + alignment - 1)));

// Align buffer pointer
long misalignment = mUnmanagedBuffer.ToInt64() % alignment;
if (misalignment != 0)
    mAlignedBuffer = new IntPtr(mUnmanagedBuffer.ToInt64() + misalignment);
else
    mAlignedBuffer = mUnmanagedBuffer;

Maybe the error is caused by Marshal.AllocHGlobal or IntPtr black magic?

This is the minimal source to spot the error:

void Matrix4x4_Multiply_SSE(float *result, float *left, float *right)
{
    __asm {
        movups xmm0,    right[ 0];

        movups result, xmm0;
    }
}


int main(int argc, char *argv[])
{
    float r0[16];
    float m1[16], m2[16];

    m1[ 0] = 1.0f; m1[ 4] = 0.0f; m1[ 8] = 0.0f; m1[12] = 0.0f;
    m1[ 1] = 0.0f; m1[ 5] = 1.0f; m1[ 9] = 0.0f; m1[13] = 0.0f;
    m1[ 2] = 0.0f; m1[ 6] = 0.0f; m1[10] = 1.0f; m1[14] = 0.0f;
    m1[ 3] = 0.0f; m1[ 7] = 0.0f; m1[11] = 0.0f; m1[15] = 1.0f;

    m2[ 0] = 1.0f; m2[ 4] = 0.0f; m2[ 8] = 0.0f; m2[12] = 0.0f;
    m2[ 1] = 0.0f; m2[ 5] = 1.0f; m2[ 9] = 0.0f; m2[13] = 0.0f;
    m2[ 2] = 0.0f; m2[ 6] = 0.0f; m2[10] = 1.0f; m2[14] = 0.0f;
    m2[ 3] = 0.0f; m2[ 7] = 0.0f; m2[11] = 0.0f; m2[15] = 1.0f;

    r0[ 0] = 0.0f; r0[ 4] = 0.0f; r0[ 8] = 0.0f; r0[12] = 0.0f;
    r0[ 1] = 0.0f; r0[ 5] = 0.0f; r0[ 9] = 0.0f; r0[13] = 0.0f;
    r0[ 2] = 0.0f; r0[ 6] = 0.0f; r0[10] = 0.0f; r0[14] = 0.0f;
    r0[ 3] = 0.0f; r0[ 7] = 0.0f; r0[11] = 0.0f; r0[15] = 0.0f;

    Matrix4x4_Multiply_SSE(r0, m1, m2);
    Matrix4x4_Multiply_SSE(r0, m1, m2);

    return (0);
}

Pratically after the second movups, the stack changes the result value (stored on the stack), and stores the values of xmm0 on the modified (and wrong) address stored in result.

After having stepped out from *Matrix4x4_Multiply_SSE*, the original memory isn't modified.

What am I missing?

回答1:

You assembly was flawed. There is a difference between

void DoSomething(int *x)
{
    __asm
    {
        mov x[0], 10   // wrong
            mov [x], 10    // also wrong
        mov esi,x      // first get address
        mov [esi],500  // then assign - correct
    }
}

The first two examples did not write to the memory location pointed to the pointer but to the storage location of the pointer itself. Since the parameter comes from the stack you did overwrite with the movups instruction your stack. You can see this in the debugger window when you call e.g.

int x=0;
DoSomething(&x);

With mov [x],10 you do not set x to 10 but you write into your stack.

回答2:

The alignment correction is wrong. You need to add alignment-misalignment to correct the alignment. So the code should read:

mAlignedBuffer = 
    new IntPtr(mUnmanagedBuffer.ToInt64() + alignment - misalignment);

However, I would recommend that you test the function in a native setting first. Once you know it works there you can move to the managed setting and know that any problems are due to the managed code.

回答3:

I find out a solution. Loading pointer value on CPU register, and using the register for redirect to memory:

mov esi, result;
movups [esi][ 0], xmm0;

Using those instructions makes the code working as expected.

But the question remain unsolved completely, since the movups instruction can take as first argument a memory address; so if someone knows what's going on, I'm pleased to check the best answer.