Using a specific zmm register in inline asm

2019-07-29 03:59发布

问题:

Can I tell gcc-style inline assembly to put my __m512i variable into a specific zmm register, like zmm31?

回答1:

Like on targets where the are no specific-register constraints at all (like ARM), use local register variables to get broad constraints to pick a specific register for asm statements. The compiler can still optimize otherwise, because the only documented guaranteed effect of a register-local is for asm inputs/outputs.

The compiler will prefer the specified register even if there's no asm, though. (So you can write code that appears to work but isn't safe in general with stuff like register int ebx asm("ebx"); return ebx;. GCC documentation is what makes a behaviour guaranteed / future-proof, even if current gcc prefers using the specified register strongly enough to waste instructions when the constraint isn't compatible with the specified register, see below.)

Anyway, this use of register-asm local variables is the only thing they're guaranteed to work for:

#include <immintrin.h>
__m512i foo() {
    register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
    register __m512i z30 asm("zmm30");

    asm("vmovdqa64 %1, %0  # from inline asm"
        : "=v"(z30)
        : "v"(z31)
       );
    return z30;
}

On the Godbolt compiler explorer, compiles to this with clang6.0:

    # clang -O3 -march=skylake-avx512
    vbroadcastss    .LCPI0_0(%rip), %zmm31 # zmm31 = [1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43,1.72359711E-43]
    vmovdqa64       %zmm31, %zmm30        # from inline asm
    vmovaps %zmm30, %zmm0
    retq

and gcc8.2:

# gcc -O3 -march=skylake-avx512
foo():
    movl    $123, %eax
    vpbroadcastd    %eax, %zmm31
    vmovdqa64 %zmm31, %zmm30  # from inline asm
    vmovdqa64       %zmm30, %zmm0
    ret

Note the "v" constraints which allow any EVEX vector register (0..31), unlike "x" which only allows the first 16. "x" is documented as "any SSE register", but also applies to AVX YMM registers. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.

Using "x" for this didn't result in any warnings, but with gcc "x" won vs. the register-variable declaration, so it chose %zmm2 and %zmm1 (strangely not zmm0 so an extra move was required). The register-asm declaration thus did cost us efficiency.

With clang it still used zmm31 and zmm30, apparently violating the "x" constraint, so it would have failed to assemble if you'd used an instruction with no EVEX version on the XMM or YMM part of the register operand, like AVX2 vpcmpeqd ymm,ymm,ymm (compare into vector, not compare into mask). (In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?).

//#ifndef __clang__
__m512i broken_with_clang() {
    register __m512i z31 asm("zmm31") = _mm512_set1_epi32(123);
    register __m512i z30 asm("zmm30") = _mm512_setzero_si512();
    // notice that gcc still inits these in zmm31 and 30, *then* copies
    // so register asm costs us efficiency.

    // AVX512 only has compares into k registers, not into YMM registers.
    asm("vpcmpeqd %t1, %t0, %t0  # from inline asm. input was %0"
        : "+x"(z30)
        : "x"(z31)
       );
    return z30;
}
//#endif

With clang we get an error for each operand; I guess clang doesn't support t modifiers to get the YMM name of the register (because it fails with clang6.0 even if I remove the register ... asm() stuff entirely.)

<source>:21:9: error: invalid operand in inline asm: 'vpcmpeqd ${1:t}, ${0:t}, ${0:t}  # from inline asm. input was $0'
    asm("vpcmpeqd %t1, %t0, %t0  # from inline asm. input was %0"
        ^
...
<source>:21:9: error: unknown token in expression
<inline asm>:1:11: note: instantiated into assembly here
        vpcmpeqd , ,   # from inline asm. input was %zmm30

But gcc compiles it just fine:

broken_with_clang():
    movl    $123, %eax
    vpbroadcastd    %eax, %zmm31
    vpxord  %xmm30, %xmm30, %xmm30

    vmovdqa64       %zmm30, %zmm1    # extra overhead because of register asm
    vmovdqa64       %zmm31, %zmm2    # which didn't match the constraints

    vpcmpeqd %ymm2, %ymm1, %ymm1  # from inline asm. input was %zmm1

    vmovdqa64       %zmm1, %zmm0     # extra overhead because gcc didn't pick zmm0
    ret