Work around lack of Yz machine constraint under Cl

We use inline assembly to make SHA instructions available if __SHA__ is not defined. Under GCC we use:

GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
    asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "Yz" (c));
    return a;
}

Clang does not consume GCC's Yz constraint (see Clang 3.2 Issue 13199 and Clang 3.9 Issue 32727), which is required by the sha256rnds2 instruction:

Yz

    First SSE register (%xmm0).

We added a mov for Clang:

asm ("mov %2, %%xmm0; sha256rnds2 %%xmm0, %1, %0" : "+x"(a) : "xm"(b), "x" (c) : "xmm0");

Performance is off by about 3 cycles per byte. On my 2.2 GHz Celeron J3455 test machine (Goldmont with SHA extensions), that's about 230 MiB/s. Its non-trivial.

Looking at the disassembly, Clang is not optimizing around SHA's k when two rounds are performed:

Breakpoint 2, SHA256_SSE_SHA_HashBlocks (state=0xaaa3a0,
    data=0xaaa340, length=0x40) at sha.cpp:1101
1101        STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
   0x000000000068cdd0 <+0>:     sub    $0x308,%rsp
   0x000000000068cdd7 <+7>:     movdqu (%rdi),%xmm0
   0x000000000068cddb <+11>:    movdqu 0x10(%rdi),%xmm1
   ...
   0x000000000068ce49 <+121>:   movq   %xmm2,%xmm0
   0x000000000068ce4d <+125>:   sha256rnds2 %xmm0,0x2f0(%rsp),%xmm1
   0x000000000068ce56 <+134>:   pshufd $0xe,%xmm2,%xmm3
   0x000000000068ce5b <+139>:   movdqa %xmm13,%xmm2
   0x000000000068ce60 <+144>:   movaps %xmm1,0x2e0(%rsp)
   0x000000000068ce68 <+152>:   movq   %xmm3,%xmm0
   0x000000000068ce6c <+156>:   sha256rnds2 %xmm0,0x2e0(%rsp),%xmm2
   0x000000000068ce75 <+165>:   movdqu 0x10(%rsi),%xmm3
   0x000000000068ce7a <+170>:   pshufb %xmm8,%xmm3
   0x000000000068ce80 <+176>:   movaps %xmm2,0x2d0(%rsp)
   0x000000000068ce88 <+184>:   movdqa %xmm3,%xmm4
   0x000000000068ce8c <+188>:   paddd  0x6729c(%rip),%xmm4        # 0x6f4130
   0x000000000068ce94 <+196>:   movq   %xmm4,%xmm0
   0x000000000068ce98 <+200>:   sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1
   ...

For example, 0068ce8c though 0068ce98 should have been:

paddd  0x6729c(%rip),%xmm0        # 0x6f4130
sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1

I'm guessing our choice of inline asm instructions are a bit off.

How do we work around the lack of Yz machine constraint under Clang? What pattern avoids the intermediate move in optimized code?

Attempting to use Explicit Register Variable:

const __m128i k asm("xmm0") = c;
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
return a;

Results in:

In file included from sha.cpp:24:
./cpu.h:831:22: warning: ignored asm label 'xmm0' on automatic variable
        const __m128i k asm("xmm0") = c;
                            ^
./cpu.h:833:7: error: invalid operand for instruction
        asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
             ^
<inline asm>:1:21: note: instantiated into assembly here
        sha256rnds2 %xmm1, 752(%rsp), %xmm0
                           ^~~~~~~~~~
In file included from sha.cpp:24:
./cpu.h:833:7: error: invalid operand for instruction
        asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
             ^
<inline asm>:1:21: note: instantiated into assembly here
        sha256rnds2 %xmm3, 736(%rsp), %xmm1
                           ^~~~~~~~~~
...

I created this answer based on the tag inline assembly with no specific language mentioned. Extended assembly templates already assume use of extensions to the languages.

If the Yz constraint isn't available you can attempt to create a temporary variable to tell CLANG what register to use rather than a constraint. You can do this through what is called an Explicit Register Variable:

You can define a local register variable and associate it with a specified register like this:
 register int *foo asm ("r12");
Here r12 is the name of the register that should be used. Note that this is the same syntax used for defining global register variables, but for a local variable the declaration appears within a function. The register keyword is required, and cannot be combined with static. The register name must be a valid register name for the target platform.

In your case you wish to force usage of xmm0 register. You could assign the c parameter to a temporary variable using an explicit register and use that temporary as a parameter to the Extended Inline Assembly. This is the primary purpose of explicit registers in GCC/CLANG.

GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
   register const __m128i tmpc asm("xmm0") = c;
   __asm__("sha256rnds2 %2, %1, %0" : "+x"(a) : "x"(b), "x" (tmpc));
    return a;
}

The compiler should be able to provide some optimizations now since it has more knowledge as to how the xmm0 register is to be used.

When you placed mov %2, %%xmm0; into the template CLANG (and GCC) do not do any optimizations on the instructions. Basic Assembly and Extended Assembly templates are a black box that it only knows how to do basic substitution based on the constraints.

Here's a disassembly using the method above. It was compiled with clang++ and -std=c++03. The extra moves are no longer present:

Breakpoint 1, SHA256_SSE_SHA_HashBlocks (state=0x7fffffffae60,
    data=0x7fffffffae00, length=0x40) at sha.cpp:1101
1101        STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
   0x000000000068cf60 <+0>:     sub    $0x308,%rsp
   0x000000000068cf67 <+7>:     movdqu (%rdi),%xmm0
   0x000000000068cf6b <+11>:    movdqu 0x10(%rdi),%xmm1
...
   0x000000000068cfe6 <+134>:   paddd  0x670e2(%rip),%xmm0        # 0x6f40d0
   0x000000000068cfee <+142>:   sha256rnds2 %xmm0,0x2f0(%rsp),%xmm2
   0x000000000068cff7 <+151>:   pshufd $0xe,%xmm0,%xmm1
   0x000000000068cffc <+156>:   movdqa %xmm1,%xmm0
   0x000000000068d000 <+160>:   movaps %xmm2,0x2e0(%rsp)
   0x000000000068d008 <+168>:   sha256rnds2 %xmm0,0x2e0(%rsp),%xmm3
   0x000000000068d011 <+177>:   movdqu 0x10(%rsi),%xmm5
   0x000000000068d016 <+182>:   pshufb %xmm9,%xmm5
   0x000000000068d01c <+188>:   movaps %xmm3,0x2d0(%rsp)
   0x000000000068d024 <+196>:   movdqa %xmm5,%xmm0
   0x000000000068d028 <+200>:   paddd  0x670b0(%rip),%xmm0        # 0x6f40e0
   0x000000000068d030 <+208>:   sha256rnds2 %xmm0,0x2d0(%rsp),%xmm2
...