While trying to answer Embedded broadcasts with intrinsics and assembly, I was trying to do something like this:
__m512 mul_broad(__m512 a, float b) {
int scratch = 0;
asm(
"vbroadcastss %k[scalar], %q[scalar]\n\t" // want vbr.. %xmm0, %zmm0
"vmulps %q[scalar], %[vec], %[vec]\n\t"
// how it's done for integer registers
"movw symbol(%q[inttmp]), %w[inttmp]\n\t" // movw symbol(%rax), %ax
"movsbl %h[inttmp], %k[inttmp]\n\t" // movsx %ah, %eax
: [vec] "+x" (a), [scalar] "+x" (b), [inttmp] "=r" (scratch)
:
:
);
return a;
}
The GNU C x86 Operand Modifiers doc only specifies modifiers up to q
(DI (DoubleInt) size, 64bits). Using q
on a vector register will always bring it down to xmm
(from ymm
or zmm
).
The question:
What are the modifiers to change between sizes of vector register?
Also, are there any specific-size constraints for use with input or output operands? Something other than the generic x
which can end up being xmm, ymm, or zmm depending on the type of the expression you put in the parentheses.
Off-topic:
clang appears to have some Yi
/ Yt
constraints (not modifiers), but I can't find docs on that either. clang won't even compile this, even with the vector instructions commented out, because it doesn't like +x
as a constraint for an __m512
vector.
Background / motivation
I can get the result I want by passing in the scalar as an input operand, constrained to be in the same register as a wider output operand, but it's clumsier. (The biggest downside for this use-case is that AFAIK it has to use an operand-number, rather than the [symbolic_name]
, so it's susceptible to breakage when adding/removing output constraints.)
// does what I want, by using a paired output and input constraint
__m512 mul_broad(__m512 a, float b) {
__m512 tmpvec;
asm(
"vbroadcastss %[scalar], %[tmpvec]\n\t"
"vmulps %[tmpvec], %[vec], %[vec]\n\t"
: [vec] "+x" (a), [tmpvec] "=x" (tmpvec)
: [scalar] "1" (b)
:
);
return a;
}
godbolt link
Also, I think this whole approach to the problem I was trying to solve is going to be a dead end because Multi-Alternative constraints don't let you give different asm for the different constraint patterns. I was hoping to have x
and r
constraints end up emitting a vbroadcastss
from a register, while m
constraints end up emitting vmulps (mem_src){1to16}, %zmm_src2, %zmm_dst
(a folded broadcast-load). The purpose of doing this with inline asm is that gcc doesn't yet know how to fold set1()
memory operands into broadcast-loads (but clang does).
Anyway, this specific question is about operand modifiers and constraints for vector registers. Please focus on that, but comments and asides in answers are welcome on the other issue. (Or better, just comment / answer on Z Boson's question about embedded broadcasts.)