fastest way to fill a vector (SSE2) with a certain

2020-03-05 06:35发布

问题:

I have this template class:

template<size_t D>
struct A{
    double v_sse __attribute__ ((vector_size (8*D)));
    A(double val){
        //what here?
    }
};

What's the best way to fill the v_sse field with copies of val? Since I use vectors, I can use gcc SSE2 intrinsics.

回答1:

It would be nice if we could write code once, and compile it for wider vectors with just a small tweak, even in cases where auto-vectorization doesn't do the trick.

I got the same result as @hirschhornsalz: massive, inefficient code when instantiating this with vectors bigger than HW-supported vector sizes. e.g. constructing A<8> without AVX512 produces a boatload of 64bit mov and vmovsd instructions. It does one broadcast to a local on the stack, and then reads back all of those values separately, and writes them to the caller's struct-return buffer.

For x86, we can get gcc to emit optimal broadcasts for a function that takes a double arg (in xmm0), and returns a vector (in x/y/zmm0), per standard calling conventions:

  • SSE2: unpckpd xmm0, xmm0
  • SSE3: movddup xmm0, xmm0
  • AVX: vmovddup xmm0, xmm0 / vinsertf128 ymm0, ymm0, xmm0, 1
    (AVX1 only includes the vbroadcastsd ymm, m64 form, which would presumably get used if inlined at call on data in memory)
  • AVX2: vbroadcastsd ymm0, xmm0
  • AVX512: vbroadcastsd zmm0, xmm0. (Note that AVX512 can broadcast from mem on the fly:
    VADDPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er}
    {k1}{z} means it can use a mask register as a merge or zero mask into the result.
    m64bcst mean a 64bit memory address to be broadcast.
    {er} means the MXCSR rounding mode can be overridden for this one instruction.
    IDK if gcc will use this broadcast addressing mode to fold broadcast-loads into memory operands.

However, gcc also understands shuffles, and has __builtin_shuffle for arbitrary vector sizes. With a compile-time constant mask of all-zeros, the shuffle becomes a broadcast, which gcc does using the best instruction for the job.

typedef int64_t v4di __attribute__ ((vector_size (32)));
typedef double  v4df __attribute__ ((vector_size (32)));
v4df vecinit4(double v) {
    v4df v_sse;
    typeof (v_sse) v_low = {v};
    v4di shufmask = {0};
    v_sse = __builtin_shuffle (v_low, shufmask );
    return v_sse;
}

In template functions, gcc 4.9.2 appears to have a problem recognizing that both vectors are the same width and number of elements, and that the mask is an int vector. It errors even without instantiating the template, so maybe that's why it has a problem with the types. Everything works perfectly if I copy the class and un-template it to a specific vector size.

template<int D> struct A{
    typedef double  dvec __attribute__ ((vector_size (8*D)));
    typedef int64_t ivec __attribute__ ((vector_size (8*D)));
    dvec v_sse;  // typeof(v_sse) is buggy without this typedef, in a template class
    A(double v) {
#ifdef SHUFFLE_BROADCAST  // broken on gcc 4.9.2
    typeof(v_sse)  v_low = {v};
    //int64_t __attribute__ ((vector_size (8*D))) shufmask = {0};
    ivec shufmask = {0, 0};
    v_sse = __builtin_shuffle (v_low, shufmask);  // no idea why this doesn't compile
#else
    typeof (v_sse) zero = {0, 0};
    v_sse = zero + v;  // doesn't optimize away without -ffast-math
#endif
    }
};

/*  doesn't work:
double vec2val  __attribute__ ((vector_size (16))) = {v, v};
double vec4val  __attribute__ ((vector_size (32))) = {v, v, v, v};
v_sse = __builtin_choose_expr (D == 2, vec2val, vec4val);
*/

I managed to get gcc to internal-compiler-error when compiling with -O0. vectors + templates appears to need some work. (At least, it did back in gcc 4.9.2 which Ubuntu is currently shipping. Upstream may have improved.)

The first idea I had, which I left in as a fallback because shuffle doesn't compile, is that gcc implicitly broadcasts when you use an operator with a vector and a scalar. So for example, adding a scalar to a vector of all-zeroes will do the trick.

The problem is that the actual add won't be optimized away unless you use -ffast-math. -funsafe-math-optimizations is unfortunately required, not just -fno-signaling-nans. I tried alternatives to + that can't cause FPU exceptions, such as ^ (xor) and | (or), but gcc won't do those on doubles. The , operator doesn't produce a vector result for scalar , vector.

This can be worked around by specializing the template with straightforward initializer lists. If you can't get a good generic constructor to work, I suggest leaving out the definition so you get a compile error when there isn't a specialization.

#ifndef NO_BROADCAST_SPECIALIZE
// specialized versions with initializer lists to work efficiently even without -ffast-math
// inline keyword prevents an actual definition from being emitted.
template<> inline A<2>::A (double v) {
    typeof (v_sse) val = {v, v};
    v_sse = val;
}
template<> inline A<4>::A (double v) {
    typeof (v_sse) val = {v, v, v, v};
    v_sse = val;
}
template<> inline A<8>::A (double v) {
    typeof (v_sse) val = {v, v, v, v, v, v, v, v};
    v_sse = val;
}
template<> inline A<16>::A (double v) { // AVX1024 or something may exist someday
    typeof (v_sse) val = {v, v, v, v, v, v, v, v, v, v, v, v, v, v, v, v};
    v_sse = val;
}
#endif

Now, to test the results:

// vecinit4 (from above) included in the asm output too.
// instantiate the templates
A<2> broadcast2(double val) { return A<2>(val); }
A<4> broadcast4(double val) { return A<4>(val); }
A<8> broadcast8(double val) { return A<8>(val); }

Compiler output (assembler directives stripped out):

g++ -DNO_BROADCAST_SPECIALIZE  -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-

_Z8vecinit4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast2d:
    vmovddup        xmm1, xmm0
    vxorpd  xmm0, xmm0, xmm0
    vaddpd  xmm0, xmm1, xmm0
    ret
_Z10broadcast4d:
    vbroadcastsd    ymm1, xmm0
    vxorpd  xmm0, xmm0, xmm0
    vaddpd  ymm0, ymm1, ymm0
    ret
_Z10broadcast8d:
    vbroadcastsd    zmm0, xmm0
    vpxorq  zmm1, zmm1, zmm1
    vaddpd  zmm0, zmm0, zmm1
    ret


g++ -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-
# or   g++ -ffast-math -DNO_BROADCAST_SPECIALIZE blah blah.

_Z8vecinit4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast2d:
    vmovddup        xmm0, xmm0
    ret
_Z10broadcast4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast8d:
    vbroadcastsd    zmm0, xmm0
    ret

Note that the shuffle method should work fine if you don't template this, but instead only use one vector size in your code. So changing from SSE to AVX is as easy as changing 16 to 32 in one place. But then you'd need to compile the same file multiple times to generate an SSE version and an AVX version which you could dispatch to at runtime. (You might need that anyway, though, to have a 128bit SSE version that didn't use VEX instruction encoding.)