SSE half loads (_mm_loadh_pi / _mm_loadl_pi) issue

2019-02-27 20:53发布

I have borrowed a matrix inversion algorithm from Intel website: http://download.intel.com/design/PentiumIII/sml/24504301.pdf

It uses _mm_loadh_pi and _mm_loadl_pi to load the 4x4 matrix coefficients and do a partial shuffling at the same time. The performance improvement in my app is significant, and if I do a classic load/shuffle of the matrix using _mm_load_ps, it's slightly slower.

But this load approach issues compilation warnings : "tmp1 is used uninitialized in this function"

__m128 tmp1;
tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));

Which makes sense in a way, since tmp1 is an input parameter of _mm_loadl_pi, and affects the result.

However, looking in details to what the code does shows that tmp1 needs no initialization. And initialization slightly slows down the code (it's measurable).

Do you have any idea on how to remove the warning, if possible in a portable way, without having to initialize tmp1?

2条回答
甜甜的少女心
2楼-- · 2019-02-27 21:00

This is what _mm_undefined_ps is for (but it only actually helps code-gen with Intel's compiler. Other compilers usually treat it similarly to _mm_setzero_ps).

And besides that, you want a movsd load of two floats that zero-extends and breaks the false dependency on the old value of a register, not a movlps that merges. (Unless you're building for a crusty old 32-bit CPU that has SSE1 but not SSE2, like the PIII your code was originally written for.)

Cast to double * and use _mm_load_sd. You're not dereferencing it yourself, only via _mm_load_sd, so I think this is still 100% strict-aliasing safe. It works in practice on current compilers, though! If it turns out to be unsafe, _mm_loadl_epi64 (movq) takes a __m128i const* arg (weird because it only loads the low 64 bits, but it's a may_alias type you can definitely use safely to read any other type, like char*.)

static inline
__m128 stride_gather(float *src) {
    __m128 tmp1 = _mm_castpd_ps(_mm_load_sd((const double*)src));  // movsd
    tmp1 = _mm_loadh_pi(tmp1, (const __m64*)(src+4));              // movhps
    return tmp1;
}

gcc7 and later use movq instead of movsd, which is weird but I think it's fine. At worst an extra cycle of bypass-delay latency as an input to movhps on some old CPUs, but not a throughput penalty.

The other 3 major compilers (clang/ICC/MSVC) all compile this to the expected movsd / movhps with no false dependency on the old value of xmm0. (source+asm output on the Godbolt compiler explorer.)

查看更多
forever°为你锁心
3楼-- · 2019-02-27 21:19

I tried 3 compilers: MS Visual Studio 2012, gcc481, and Intel icl 13.1. They all warn as you point out. I found that both gcc and MS automatically generate initialization code for tmp1, even as they warn about lack of initialization. The MS compiler generates an undesirable memory access: movaps xmm0,xmmword ptr [rsp]. Gcc generates a more efficient xorps xmm0,xmm0. So in the case of gcc, adding tmp1=_mm_setzero_ps() eliminates the warning and produces exactly the same code as without. In the case of MS, adding tmp1=_mm_setzero_ps() makes the code shorter and probably faster. Only the Intel compiler is smart enough to avoid the unneeded initialization. Here is a possible workaround for MS and gcc compilers:

    __m128 tmp1 = _mm_loadh_pi(_mm_load_ps (src), (__m64*)(src + 4));

Code generation is:

movaps      xmm0,xmmword ptr [rcx]
movhps      xmm0,qword ptr [rcx+10h]

It looks shorter, but should be benchmarked to make sure it is faster.

09/12/2013: test code for different warning suppression ideas:

#include <xmmintrin.h>
#include <stdint.h>
#include <stdio.h>

//---------------------------------------------------------------------------
// original code from http://download.intel.com/design/PentiumIII/sml/24504301.pdf
__m128 func1 (float *src)
    {
    __m128 tmp1;
    tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));
    return tmp1;
    }

//---------------------------------------------------------------------------
// original code plus tmp1 initialization
__m128 func2 (float *src)
    {
    __m128 tmp1 = _mm_loadh_pi(_mm_loadl_pi (_mm_setzero_ps (), (__m64*)(src)), (__m64*)(src + 4));
    return tmp1;
    }

//---------------------------------------------------------------------------
// use redundant load to eliminate warning 
__m128 func3 (float *src)
    {
    __m128 tmp1 = _mm_loadh_pi(_mm_load_ps (src), (__m64*)(src + 4));
    return tmp1;
    }

//---------------------------------------------------------------------------

static void dump (void *data)
    {
    float *f16 = data;
    int index;

    for (index = 0; index < 4; index++)
        printf ("%g ", f16 [index]);
    printf ("\n");
    }

//---------------------------------------------------------------------------

int main (void)
    {
    float f [8] = {1, 2, 3, 4, 5, 6, 7, 8};
    __m128 tmp;

    tmp = func1 (f);
    dump (&tmp);
    tmp = func2 (f);
    dump (&tmp);
    tmp = func3 (f);
    dump (&tmp);
    return 0;
    }

build commands:

gcc  -O3 -Wall -Wfatal-errors sample.c -osample.exe
objdump -Mintel --disassemble sample.exe > disasm.txt

cl -Ox -Zi -W4 sample.c
dumpbin -disasm -symbols sample.exe > disasm.txt

icl -Ox -Zi sample.c                                           
dumpbin -disasm -symbols sample.exe > disasm.txt                  
查看更多
登录 后发表回答