My problem is, that the compiler chooses not to inline a function in a specific case, thus making the code a LOT slower.The function is supposed to compute the dot product for a vector(SIMD accelerated).I have it written in two different styles:
- The Vector class aggregates a __m128 member.
- The Vector is just a typedef of the __m128 member.
In case 1 I get 2 times slower code, the function doesn't inline. In case 2 I get optimal code, very fast, inlined.
In case 1 the Vector and the Dot functions look like this:
__declspec(align(16)) class Vector
{
public:
__m128 Entry;
Vector(__m128 s)
{
Entry = s;
}
};
Vector Vector4Dot(Vector v1, Vector v2)
{
return(Vector(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4)));
}
In case 2 the Vector and the Dot functions look like this:
typedef __m128 Vector;
Vector Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1, v2, DotMask4));
}
I'm compiling on MSVC in Visual Studio 2012 on x86 in Release mode with all optimizations enabled, optimize for speed, whole program optimization, etc.Whether I put all the code of case 1 in the header or use this with combination with forceinline, it doesn't matter, it doesn't get inlined.Here is the generated ASM:
Case 1:
movaps xmm0, XMMWORD PTR [esi]
lea eax, DWORD PTR $T1[esp+32]
movaps xmm1, xmm0
push eax
call ?Vector4Dot@Framework@@SA?AVVector@23@T__m128@@0@Z ; Framework::Vector4Dot
movaps xmm0, XMMWORD PTR [eax]
add esp, 4
movaps XMMWORD PTR [esi], xmm0
lea esi, DWORD PTR [esi+16]
dec edi
jne SHORT $LL3@Test89
This is at the place where I call Vector4Dot.Here is the inside of it(the function):
mov eax, DWORD PTR _v2$[esp-4]
movaps xmm0, XMMWORD PTR [edx]
dpps xmm0, XMMWORD PTR [eax], 255 ; 000000ffH
movaps XMMWORD PTR [ecx], xmm0
mov eax, ecx
For case 2 I just get:
movaps xmm0, XMMWORD PTR [eax]
dpps xmm0, xmm0, 255 ; 000000ffH
movaps XMMWORD PTR [eax], xmm0
lea eax, DWORD PTR [eax+16]
dec ecx
jne SHORT $LL3@Test78
Which is a LOT faster.I'm not sure why the compiler can't deal with that constructor.If I change case1 like this:
__m128 Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4));
}
It compiles at maximum speed the same as case 2.It's this "class overhead" that is giving me the performance penalty.Is there any way to get around this?Or am I stuck with using raw __m128's instead of the Vector class?