I have this template class:
template<size_t D>
struct A{
double v_sse __attribute__ ((vector_size (8*D)));
A(double val){
//what here?
}
};
What's the best way to fill the v_sse
field with copies of val
? Since I use vectors, I can use gcc SSE2 intrinsics.
It would be nice if we could write code once, and compile it for wider vectors with just a small tweak, even in cases where auto-vectorization doesn't do the trick.
I got the same result as @hirschhornsalz: massive, inefficient code when instantiating this with vectors bigger than HW-supported vector sizes. e.g. constructing A<8>
without AVX512 produces a boatload of 64bit mov
and vmovsd
instructions. It does one broadcast to a local on the stack, and then reads back all of those values separately, and writes them to the caller's struct-return buffer.
For x86, we can get gcc to emit optimal broadcasts for a function that takes a double
arg (in xmm0), and returns a vector (in x/y/zmm0), per standard calling conventions:
- SSE2:
unpckpd xmm0, xmm0
- SSE3:
movddup xmm0, xmm0
- AVX:
vmovddup xmm0, xmm0 / vinsertf128 ymm0, ymm0, xmm0, 1
(AVX1 only includes the vbroadcastsd ymm, m64
form, which would
presumably get used if inlined at call on data in memory)
- AVX2:
vbroadcastsd ymm0, xmm0
- AVX512:
vbroadcastsd zmm0, xmm0
. (Note that AVX512 can broadcast from mem on the fly:
VADDPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er}
{k1}{z}
means it can use a mask register as a merge or zero mask into the result.
m64bcst
mean a 64bit memory address to be broadcast.
{er}
means the MXCSR rounding mode can be overridden for this one instruction.
IDK if gcc will use this broadcast addressing mode to fold broadcast-loads into memory operands.
However, gcc also understands shuffles, and has __builtin_shuffle
for arbitrary vector sizes. With a compile-time constant mask of all-zeros, the shuffle becomes a broadcast, which gcc does using the best instruction for the job.
typedef int64_t v4di __attribute__ ((vector_size (32)));
typedef double v4df __attribute__ ((vector_size (32)));
v4df vecinit4(double v) {
v4df v_sse;
typeof (v_sse) v_low = {v};
v4di shufmask = {0};
v_sse = __builtin_shuffle (v_low, shufmask );
return v_sse;
}
In template functions, gcc 4.9.2 appears to have a problem recognizing that both vectors are the same width and number of elements, and that the mask is an int vector. It errors even without instantiating the template, so maybe that's why it has a problem with the types. Everything works perfectly if I copy the class and un-template it to a specific vector size.
template<int D> struct A{
typedef double dvec __attribute__ ((vector_size (8*D)));
typedef int64_t ivec __attribute__ ((vector_size (8*D)));
dvec v_sse; // typeof(v_sse) is buggy without this typedef, in a template class
A(double v) {
#ifdef SHUFFLE_BROADCAST // broken on gcc 4.9.2
typeof(v_sse) v_low = {v};
//int64_t __attribute__ ((vector_size (8*D))) shufmask = {0};
ivec shufmask = {0, 0};
v_sse = __builtin_shuffle (v_low, shufmask); // no idea why this doesn't compile
#else
typeof (v_sse) zero = {0, 0};
v_sse = zero + v; // doesn't optimize away without -ffast-math
#endif
}
};
/* doesn't work:
double vec2val __attribute__ ((vector_size (16))) = {v, v};
double vec4val __attribute__ ((vector_size (32))) = {v, v, v, v};
v_sse = __builtin_choose_expr (D == 2, vec2val, vec4val);
*/
I managed to get gcc to internal-compiler-error when compiling with -O0
. vectors + templates appears to need some work. (At least, it did back in gcc 4.9.2 which Ubuntu is currently shipping. Upstream may have improved.)
The first idea I had, which I left in as a fallback because shuffle doesn't compile, is that gcc implicitly broadcasts when you use an operator with a vector and a scalar. So for example, adding a scalar to a vector of all-zeroes will do the trick.
The problem is that the actual add won't be optimized away unless you use -ffast-math
. -funsafe-math-optimizations
is unfortunately required, not just -fno-signaling-nans
. I tried alternatives to +
that can't cause FPU exceptions, such as ^
(xor) and |
(or), but gcc won't do those on double
s. The ,
operator doesn't produce a vector result for scalar , vector
.
This can be worked around by specializing the template with straightforward initializer lists. If you can't get a good generic constructor to work, I suggest leaving out the definition so you get a compile error when there isn't a specialization.
#ifndef NO_BROADCAST_SPECIALIZE
// specialized versions with initializer lists to work efficiently even without -ffast-math
// inline keyword prevents an actual definition from being emitted.
template<> inline A<2>::A (double v) {
typeof (v_sse) val = {v, v};
v_sse = val;
}
template<> inline A<4>::A (double v) {
typeof (v_sse) val = {v, v, v, v};
v_sse = val;
}
template<> inline A<8>::A (double v) {
typeof (v_sse) val = {v, v, v, v, v, v, v, v};
v_sse = val;
}
template<> inline A<16>::A (double v) { // AVX1024 or something may exist someday
typeof (v_sse) val = {v, v, v, v, v, v, v, v, v, v, v, v, v, v, v, v};
v_sse = val;
}
#endif
Now, to test the results:
// vecinit4 (from above) included in the asm output too.
// instantiate the templates
A<2> broadcast2(double val) { return A<2>(val); }
A<4> broadcast4(double val) { return A<4>(val); }
A<8> broadcast8(double val) { return A<8>(val); }
Compiler output (assembler directives stripped out):
g++ -DNO_BROADCAST_SPECIALIZE -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-
_Z8vecinit4d:
vbroadcastsd ymm0, xmm0
ret
_Z10broadcast2d:
vmovddup xmm1, xmm0
vxorpd xmm0, xmm0, xmm0
vaddpd xmm0, xmm1, xmm0
ret
_Z10broadcast4d:
vbroadcastsd ymm1, xmm0
vxorpd xmm0, xmm0, xmm0
vaddpd ymm0, ymm1, ymm0
ret
_Z10broadcast8d:
vbroadcastsd zmm0, xmm0
vpxorq zmm1, zmm1, zmm1
vaddpd zmm0, zmm0, zmm1
ret
g++ -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-
# or g++ -ffast-math -DNO_BROADCAST_SPECIALIZE blah blah.
_Z8vecinit4d:
vbroadcastsd ymm0, xmm0
ret
_Z10broadcast2d:
vmovddup xmm0, xmm0
ret
_Z10broadcast4d:
vbroadcastsd ymm0, xmm0
ret
_Z10broadcast8d:
vbroadcastsd zmm0, xmm0
ret
Note that the shuffle method should work fine if you don't template this, but instead only use one vector size in your code. So changing from SSE to AVX is as easy as changing 16 to 32 in one place. But then you'd need to compile the same file multiple times to generate an SSE version and an AVX version which you could dispatch to at runtime. (You might need that anyway, though, to have a 128bit SSE version that didn't use VEX instruction encoding.)