The idea is that I'd like to collect returned values of double
into a vector register for processing for machine imm width
at a time without storing back into memory first.
The particular processing is a vfma
with other two operands that are all constexpr
, so that they can simply be summoned by _mm256_setr_pd
or aligned/unaligned memory load from constexpr
array
.
Is there a way to store double in %ymm
at particular position directly from value in %rax
for collecting purpose?
The target machine is Kaby Lake. More efficient of future vector instructions are welcome also.
Inline-assembly is usually a bad idea: modern compilers do a good job with x86 intrinsics.
Putting the bit-pattern for a
double
into RAX is usually also not useful, and smells like you've probably already gone down the wrong path into sub-optimal territory. Vector shuffle instructions are usually better: element-wise insert/extract instructions already cost a shuffle uop on Intel hardware, except forvmovq %xmm0, %rax
to get the low element.Also, if you're going to insert it into another vector, you should shuffle/immediate-blend. (
vpermpd
/vblendpd
).L1d and store-forwarding cache is fast, and even store-forwarding stalls are not a disaster. Choose wisely between ALU vs. memory to gather or scatter of data into / from SIMD vectors. Also remember that insert/extract instructions need an immediate index, so if you have a runtime index for a vector, you definitely want to store it and index. (See https://agner.org/optimize/ and other performance links in https://stackoverflow.com/tags/x86/info)
Lots of insert/extract can quickly bottleneck on port 5 on Haswell and later. See Loading an xmm from GP regs for more details, and some links to gcc bug reports where I went into more detail about strategies for different element widths on different uarches and with SSE4.1 vs. without SSE4.1, etc.
There's no PD version of
extractps r/m32, xmm, imm
, andinsertps
is a shuffle between XMM vectors.To read/write the low lane of a YMM, you'll have to use integer
vpextrq $1, %xmm0, %rax
/pinsrq $1, %rax, %xmm0
. Those aren't available in YMM width, so you need multiple instructions to read/write the high lane.The VEX version
vpinsrq $1, %rax, %xmm0
will zero the high lane(s) of the destination vector's full YMM or ZMM width, which is why I suggestedpinsrq
. On Skylake and later, it won't cause an SSE/AVX transition stall. See Using ymm registers as a "memory-like" storage location for an example (NASM syntax), and also Loading an xmm from GP regsFor the low element, use
vmovq %xmm0, %rax
to extract, it's cheaper thanvpextrq
(1 uop instead of 2).For ZMM, my answer on that linked XMM from GP regs question shows that you can use AVX512F to merge-mask an integer register into a vector, given a mask register with a single bit set.
Similarly,
vpcompressq
can move an element selected by a single-bit mask to the bottom forvmovq
.But for extract, if have an index instead of
1<<index
to start with, you may be better off withvmovd %ecx, %zmm1
/vpermd %zmm0, %zmm1, %zmm2
/vmovq %zmm2, %rax
. This trick even works withvpshufb
for byte elements (with a lane at least). For lane crossing, maybe shuffle +vmovd
with the high bits of the byte index, then scalar right-shift using the low bits of the index as the byte-within-word offset. See also How to use _mm_extract_epi8 function? for intrinsics for a variable-index emulation ofpextrb
.High lane of a YMM, with AVX2
I think your best bet to write an element in the high lane of a YMM with AVX2 needs a scratch register:
vmovq %rax, %xmm0
(copy to a scratch vector)vinsertf128
(AVX1) orvpbroadcastq
/vbroadcastsd
. it's faster thanvpermq
/vpermpd
on AMD. (But the reg-reg version is still AVX2-only)vblendpd
(FP) orvpblendd
(integer) into the target YMM reg. Immediate blends with dword or larger element width are very cheap (1 uop for any vector ALU port on Intel).This is only 3 uops, but 2 of them need port 5 on Intel CPUs. (So it costs the same as a
vpinsrq
+ a blend). Only the blend is on the critical path from vector input to vector output, setting upymm0
fromrax
is independent.To read the highest element,
vpermpd
orvpermq $3, %ymm1, %ymm0
(AVX2), thenvmovq
from xmm0.To read the 2nd-highest element,
vextractf128 $1, %ymm1, %xmm0
(AVX1) andvmovq
.vextractf128
is faster thanvpermq/pd
on AMD CPUs.A bad alternative to avoid a scratch reg for insert would be
vpermq
orvperm2i128
to shuffle the qword you want to replace into the low lane,pinsrq
(notvpinsrq
), thenvpermq
to put it back in the right order. That's all shuffle uops, andpinsrq
is 2 uops. (And causes an SSE/AVX transition stall on Haswell, but not Skylake). Plus all those operations are part of a dependency chain for the register you're modifying, unlike setting up a value in another register and blending.