Strange compiler behavior when inlining(ASM code i

2019-09-13 21:34发布

问题:

My problem is, that the compiler chooses not to inline a function in a specific case, thus making the code a LOT slower.The function is supposed to compute the dot product for a vector(SIMD accelerated).I have it written in two different styles:

  1. The Vector class aggregates a __m128 member.
  2. The Vector is just a typedef of the __m128 member.

In case 1 I get 2 times slower code, the function doesn't inline. In case 2 I get optimal code, very fast, inlined.

In case 1 the Vector and the Dot functions look like this:

__declspec(align(16)) class Vector
{
    public:
        __m128 Entry;

       Vector(__m128 s)
      {
          Entry = s;
      }
};

Vector Vector4Dot(Vector v1, Vector v2)
{
    return(Vector(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4)));
}

In case 2 the Vector and the Dot functions look like this:

typedef __m128 Vector;

Vector Vector4Dot(Vector v1, Vector v2)
{
    return(_mm_dp_ps(v1, v2, DotMask4));
}

I'm compiling on MSVC in Visual Studio 2012 on x86 in Release mode with all optimizations enabled, optimize for speed, whole program optimization, etc.Whether I put all the code of case 1 in the header or use this with combination with forceinline, it doesn't matter, it doesn't get inlined.Here is the generated ASM:

Case 1:

movaps  xmm0, XMMWORD PTR [esi]
lea eax, DWORD PTR $T1[esp+32]
movaps  xmm1, xmm0
push    eax
call    ?Vector4Dot@Framework@@SA?AVVector@23@T__m128@@0@Z ; Framework::Vector4Dot
movaps  xmm0, XMMWORD PTR [eax]
add esp, 4
movaps  XMMWORD PTR [esi], xmm0
lea esi, DWORD PTR [esi+16]
dec edi
jne SHORT $LL3@Test89

This is at the place where I call Vector4Dot.Here is the inside of it(the function):

mov eax, DWORD PTR _v2$[esp-4]
movaps  xmm0, XMMWORD PTR [edx]
dpps    xmm0, XMMWORD PTR [eax], 255        ; 000000ffH
movaps  XMMWORD PTR [ecx], xmm0
mov eax, ecx

For case 2 I just get:

movaps  xmm0, XMMWORD PTR [eax]
dpps    xmm0, xmm0, 255             ; 000000ffH
movaps  XMMWORD PTR [eax], xmm0
lea eax, DWORD PTR [eax+16]
dec ecx
jne SHORT $LL3@Test78

Which is a LOT faster.I'm not sure why the compiler can't deal with that constructor.If I change case1 like this:

__m128 Vector4Dot(Vector v1, Vector v2)
{
    return(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4));
}

It compiles at maximum speed the same as case 2.It's this "class overhead" that is giving me the performance penalty.Is there any way to get around this?Or am I stuck with using raw __m128's instead of the Vector class?