How to move double in %rax into particular qword p

2020-02-07 13:31发布

The idea is that I'd like to collect returned values of double into a vector register for processing for machine imm width at a time without storing back into memory first.

The particular processing is a vfma with other two operands that are all constexpr, so that they can simply be summoned by _mm256_setr_pd or aligned/unaligned memory load from constexpr array.

Is there a way to store double in %ymm at particular position directly from value in %rax for collecting purpose?

The target machine is Kaby Lake. More efficient of future vector instructions are welcome also.

1条回答
Summer. ? 凉城
2楼-- · 2020-02-07 13:56

Inline-assembly is usually a bad idea: modern compilers do a good job with x86 intrinsics.

Putting the bit-pattern for a double into RAX is usually also not useful, and smells like you've probably already gone down the wrong path into sub-optimal territory. Vector shuffle instructions are usually better: element-wise insert/extract instructions already cost a shuffle uop on Intel hardware, except for vmovq %xmm0, %rax to get the low element.

Also, if you're going to insert it into another vector, you should shuffle/immediate-blend. (vpermpd / vblendpd).

L1d and store-forwarding cache is fast, and even store-forwarding stalls are not a disaster. Choose wisely between ALU vs. memory to gather or scatter of data into / from SIMD vectors. Also remember that insert/extract instructions need an immediate index, so if you have a runtime index for a vector, you definitely want to store it and index. (See https://agner.org/optimize/ and other performance links in https://stackoverflow.com/tags/x86/info)

Lots of insert/extract can quickly bottleneck on port 5 on Haswell and later. See Loading an xmm from GP regs for more details, and some links to gcc bug reports where I went into more detail about strategies for different element widths on different uarches and with SSE4.1 vs. without SSE4.1, etc.


There's no PD version of extractps r/m32, xmm, imm, and insertps is a shuffle between XMM vectors.

To read/write the low lane of a YMM, you'll have to use integer vpextrq $1, %xmm0, %rax / pinsrq $1, %rax, %xmm0. Those aren't available in YMM width, so you need multiple instructions to read/write the high lane.

The VEX version vpinsrq $1, %rax, %xmm0 will zero the high lane(s) of the destination vector's full YMM or ZMM width, which is why I suggested pinsrq. On Skylake and later, it won't cause an SSE/AVX transition stall. See Using ymm registers as a "memory-like" storage location for an example (NASM syntax), and also Loading an xmm from GP regs

For the low element, use vmovq %xmm0, %rax to extract, it's cheaper than vpextrq (1 uop instead of 2).


For ZMM, my answer on that linked XMM from GP regs question shows that you can use AVX512F to merge-mask an integer register into a vector, given a mask register with a single bit set.

vpbroadcastq %rax, %zmm0{%k1}

Similarly, vpcompressq can move an element selected by a single-bit mask to the bottom for vmovq.

But for extract, if have an index instead of 1<<index to start with, you may be better off with vmovd %ecx, %zmm1 / vpermd %zmm0, %zmm1, %zmm2 / vmovq %zmm2, %rax. This trick even works with vpshufb for byte elements (with a lane at least). For lane crossing, maybe shuffle + vmovd with the high bits of the byte index, then scalar right-shift using the low bits of the index as the byte-within-word offset. See also How to use _mm_extract_epi8 function? for intrinsics for a variable-index emulation of pextrb.


High lane of a YMM, with AVX2

I think your best bet to write an element in the high lane of a YMM with AVX2 needs a scratch register:

  • vmovq %rax, %xmm0 (copy to a scratch vector)
  • shuffle into position with vinsertf128 (AVX1) or vpbroadcastq/vbroadcastsd. it's faster than vpermq/vpermpd on AMD. (But the reg-reg version is still AVX2-only)
  • vblendpd (FP) or vpblendd (integer) into the target YMM reg. Immediate blends with dword or larger element width are very cheap (1 uop for any vector ALU port on Intel).

This is only 3 uops, but 2 of them need port 5 on Intel CPUs. (So it costs the same as a vpinsrq + a blend). Only the blend is on the critical path from vector input to vector output, setting up ymm0 from rax is independent.

To read the highest element, vpermpd or vpermq $3, %ymm1, %ymm0 (AVX2), then vmovq from xmm0.

To read the 2nd-highest element, vextractf128 $1, %ymm1, %xmm0 (AVX1) and vmovq. vextractf128 is faster than vpermq/pd on AMD CPUs.


A bad alternative to avoid a scratch reg for insert would be vpermq or vperm2i128 to shuffle the qword you want to replace into the low lane, pinsrq (not vpinsrq), then vpermq to put it back in the right order. That's all shuffle uops, and pinsrq is 2 uops. (And causes an SSE/AVX transition stall on Haswell, but not Skylake). Plus all those operations are part of a dependency chain for the register you're modifying, unlike setting up a value in another register and blending.

查看更多
登录 后发表回答