In section 2.5.3 "Broadcasts" of the Intel Architecture Instruction Set Extensions Programming Reference the we learn than
AVX512 (and Knights Corner) has
a bit-field to encode data broadcast for some load-op instructions, i.e. instructions that
load data from memory and perform some computational
or data movement operation.
For example using Intel assembly syntax we can broadcast the scalar at the address stored in rax
and then multiplying with the 16 floats in zmm2
and write the result to zmm1
like this
vmulps zmm1, zmm2, [rax] {1to16}
However, there are no intrinsics which can do this. Therefore, with intrinsics the compiler should be able to fold
__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);
to a single instruction
vmulps zmm1, zmm2, [rax] {1to16}
but I have not observed GCC doing this. I found a GCC bug report about this.
I have observed something similar with FMA with GCC. e.g. GCC 4.9 will not collapse _mm256_add_ps(_mm256_mul_ps(areg0,breg0)
to a single fma instruction with -Ofast
. However, GCC 5.1 does collapse it to a single fma now. At least there are intrinsics to do this with FMA e.g. _mm256_fmadd_ps
. But there is no e.g. _mm512_mulbroad_ps(vector,scalar)
intrinsic.
GCC may fix this at some point but until then assembly is the only solution.
So my question is how to do this with inline assembly in GCC?
I think I may have come up with the correct syntax (but I am not sure) for GCC inline assembly for the example above.
"vmulps (%%rax)%{1to16}, %%zmm1, %%zmm2\n\t"
I am really looking for a function like this
static inline __m512 mul_broad(__m512 a, float b) {
return a*b;
}
where if b
is in memory point to in rax
it produces
vmulps (%rax){1to16}, %zmm0, %zmm0
ret
and if b
is in xmm1
it produces
vbroadcastss %xmm1, %zmm1
vmulps %zmm1, %zmm0, %zmm0
ret
GCC will already do the vbroadcastss
-from-register case with intrinsics, but if b
is in memory, compiles this to a vbroadcastss
from memory.
__m512 mul_broad(__m512 a, float b) {
__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);
return ab;
}
clang will use a broadcast memory operand if b
is in memory.
As Peter Cordes notes GCC doesn't let you specify a different template for different constraint alternatives. So instead my solution has the assembler choose the correct instruction according to the operands chosen.
I don't have a version of GCC that supports the ZMM registers, so this following example uses XMM registers and a couple of nonexistent instructions to demonstrate how you can achieve what you're looking for.
typedef __attribute__((vector_size(16))) float v4sf;
v4sf
foo(v4sf a, float b) {
v4sf ret;
asm(".ifndef isxmm\n\t"
".altmacro\n\t"
".macro ifxmm operand, rnum\n\t"
".ifc \"\\operand\",\"%%xmm\\rnum\"\n\t"
".set isxmm, 1\n\t"
".endif\n\t"
".endm\n\t"
".endif\n\t"
".set isxmm, 0\n\t"
".set regnum, 0\n\t"
".rept 8\n\t"
"ifxmm <%2>, %%regnum\n\t"
".set regnum, regnum + 1\n\t"
".endr\n\t"
".if isxmm\n\t"
"alt-1 %1, %2, %0\n\t"
".else\n\t"
"alt-2 %1, %2, %0\n\t"
".endif\n\t"
: "=x,x" (ret)
: "x,x" (a), "x,m" (b));
return ret;
}
v4sf
bar(v4sf a, v4sf b) {
return foo(a, b[0]);
}
This example should be compiled with gcc -m32 -msse -O3
and should generate two assembler error messages similar to the following:
t103.c: Assembler messages:
t103.c:24: Error: no such instruction: `alt-2 %xmm0,4(%esp),%xmm0'
t103.c:22: Error: no such instruction: `alt-1 %xmm0,%xmm1,%xmm0'
The basic idea here is the assembler checks to see whether the second operand (%2
) is an XMM register or something else, presumably a memory location. Since the GNU assembler doesn't support much in the way of operations on strings, the second operand is compared to every possible XMM register one at a time in a .rept
loop. The isxmm
macro is used to paste %xmm
and a register number together.
For your specific problem you'd probably need to rewrite it something like this:
__m512
mul_broad(__m512 a, float b) {
__m512 ret;
__m512 dummy;
asm(".ifndef isxmm\n\t"
".altmacro\n\t"
".macro ifxmm operand, rnum\n\t"
".ifc \"\\operand\",\"%%zmm\\rnum\"\n\t"
".set isxmm, 1\n\t"
".endif\n\t"
".endm\n\t"
".endif\n\t"
".set isxmm, 0\n\t"
".set regnum, 0\n\t"
".rept 32\n\t"
"ifxmm <%[b]>, %%regnum\n\t"
".set regnum, regnum + 1\n\t"
".endr\n\t"
".if isxmm\n\t"
"vbroadcastss %x[b], %[b]\n\t"
"vmulps %[a], %[b], %[ret]\n\t"
".else\n\t"
"vmulps %[b] %{1to16%}, %[a], %[ret]\n\t"
"# dummy = %[dummy]\n\t"
".endif\n\t"
: [ret] "=x,x" (ret), [dummy] "=xm,x" (dummy)
: [a] "x,xm" (a), [b] "m,[dummy]" (b));
return ret;
}