Can I tell gcc-style inline assembly to put my __m512i
variable into a specific zmm
register, like zmm31
?
问题:
回答1:
Like on targets where the are no specific-register constraints at all (like ARM), use local register variables to get broad constraints to pick a specific register for asm
statements. The compiler can still optimize otherwise, because the only documented guaranteed effect of a register-local is for asm
inputs/outputs.
The compiler will prefer the specified register even if there's no asm
, though. (So you can write code that appears to work but isn't safe in general with stuff like register int ebx asm("ebx"); return ebx;
. GCC documentation is what makes a behaviour guaranteed / future-proof, even if current gcc prefers using the specified register strongly enough to waste instructions when the constraint isn't compatible with the specified register, see below.)
Anyway, this use of register-asm local variables is the only thing they're guaranteed to work for:
#include <immintrin.h>
__m512i foo() {
register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
register __m512i z30 asm("zmm30");
asm("vmovdqa64 %1, %0 # from inline asm"
: "=v"(z30)
: "v"(z31)
);
return z30;
}
On the Godbolt compiler explorer, compiles to this with clang6.0:
# clang -O3 -march=skylake-avx512
vbroadcastss .LCPI0_0(%rip), %zmm31 # zmm31 = [1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43]
vmovdqa64 %zmm31, %zmm30 # from inline asm
vmovaps %zmm30, %zmm0
retq
and gcc8.2:
# gcc -O3 -march=skylake-avx512
foo():
movl $123, %eax
vpbroadcastd %eax, %zmm31
vmovdqa64 %zmm31, %zmm30 # from inline asm
vmovdqa64 %zmm30, %zmm0
ret
Note the "v"
constraints which allow any EVEX vector register (0..31), unlike "x"
which only allows the first 16. "x"
is documented as "any SSE register", but also applies to AVX YMM registers. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
Using "x"
for this didn't result in any warnings, but with gcc "x"
won vs. the register-variable declaration, so it chose %zmm2 and %zmm1 (strangely not zmm0
so an extra move was required). The register-asm declaration thus did cost us efficiency.
With clang it still used zmm31 and zmm30, apparently violating the "x"
constraint, so it would have failed to assemble if you'd used an instruction with no EVEX version on the XMM or YMM part of the register operand, like AVX2 vpcmpeqd ymm,ymm,ymm
(compare into vector, not compare into mask). (In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?).
//#ifndef __clang__
__m512i broken_with_clang() {
register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
register __m512i z30 asm("zmm30") = _mm512_setzero_si512();
// notice that gcc still inits these in zmm31 and 30, *then* copies
// so register asm costs us efficiency.
// AVX512 only has compares into k registers, not into YMM registers.
asm("vpcmpeqd %t1, %t0, %t0 # from inline asm. input was %0"
: "+x"(z30)
: "x"(z31)
);
return z30;
}
//#endif
With clang we get an error for each operand; I guess clang doesn't support t
modifiers to get the YMM name of the register (because it fails with clang6.0 even if I remove the register ... asm()
stuff entirely.)
<source>:21:9: error: invalid operand in inline asm: 'vpcmpeqd ${1:t}, ${0:t}, ${0:t} # from inline asm. input was $0'
asm("vpcmpeqd %t1, %t0, %t0 # from inline asm. input was %0"
^
...
<source>:21:9: error: unknown token in expression
<inline asm>:1:11: note: instantiated into assembly here
vpcmpeqd , , # from inline asm. input was %zmm30
But gcc compiles it just fine:
broken_with_clang():
movl $123, %eax
vpbroadcastd %eax, %zmm31
vpxord %xmm30, %xmm30, %xmm30
vmovdqa64 %zmm30, %zmm1 # extra overhead because of register asm
vmovdqa64 %zmm31, %zmm2 # which didn't match the constraints
vpcmpeqd %ymm2, %ymm1, %ymm1 # from inline asm. input was %zmm1
vmovdqa64 %zmm1, %zmm0 # extra overhead because gcc didn't pick zmm0
ret