I have borrowed a matrix inversion algorithm from Intel website: http://download.intel.com/design/PentiumIII/sml/24504301.pdf
It uses _mm_loadh_pi and _mm_loadl_pi to load the 4x4 matrix coefficients and do a partial shuffling at the same time. The performance improvement in my app is significant, and if I do a classic load/shuffle of the matrix using _mm_load_ps, it's slightly slower.
But this load approach issues compilation warnings : "tmp1 is used uninitialized in this function"
__m128 tmp1;
tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));
Which makes sense in a way, since tmp1 is an input parameter of _mm_loadl_pi, and affects the result.
However, looking in details to what the code does shows that tmp1 needs no initialization. And initialization slightly slows down the code (it's measurable).
Do you have any idea on how to remove the warning, if possible in a portable way, without having to initialize tmp1?
This is what
_mm_undefined_ps
is for (but it only actually helps code-gen with Intel's compiler. Other compilers usually treat it similarly to_mm_setzero_ps
).And besides that, you want a
movsd
load of two floats that zero-extends and breaks the false dependency on the old value of a register, not amovlps
that merges. (Unless you're building for a crusty old 32-bit CPU that has SSE1 but not SSE2, like the PIII your code was originally written for.)Cast to
double *
and use_mm_load_sd
. You're not dereferencing it yourself, only via_mm_load_sd
, so I think this is still 100% strict-aliasing safe. It works in practice on current compilers, though! If it turns out to be unsafe,_mm_loadl_epi64
(movq
) takes a__m128i const*
arg (weird because it only loads the low 64 bits, but it's a may_alias type you can definitely use safely to read any other type, likechar*
.)gcc7 and later use
movq
instead ofmovsd
, which is weird but I think it's fine. At worst an extra cycle of bypass-delay latency as an input tomovhps
on some old CPUs, but not a throughput penalty.The other 3 major compilers (clang/ICC/MSVC) all compile this to the expected
movsd
/movhps
with no false dependency on the old value of xmm0. (source+asm output on the Godbolt compiler explorer.)I tried 3 compilers: MS Visual Studio 2012, gcc481, and Intel icl 13.1. They all warn as you point out. I found that both gcc and MS automatically generate initialization code for tmp1, even as they warn about lack of initialization. The MS compiler generates an undesirable memory access:
movaps xmm0,xmmword ptr [rsp]
. Gcc generates a more efficientxorps xmm0,xmm0
. So in the case of gcc, addingtmp1=_mm_setzero_ps()
eliminates the warning and produces exactly the same code as without. In the case of MS, addingtmp1=_mm_setzero_ps()
makes the code shorter and probably faster. Only the Intel compiler is smart enough to avoid the unneeded initialization. Here is a possible workaround for MS and gcc compilers:Code generation is:
It looks shorter, but should be benchmarked to make sure it is faster.
09/12/2013: test code for different warning suppression ideas:
build commands: