I had been using _mm256_lddqu_si256
based on an example I found online. Later I discovered _mm256_loadu_si256
. The Intel Intrinsics guide only states that the lddqu
version may perform better when crossing a cache line boundary. What might be the advantages of loadu
? In general how are these functions different?
相关问题
- Where can the code be more efficient for checking
- NASM x86 print integer using extern printf
- “rdtsc”: “=a” (a0), “=d” (d0) what does this do? [
- Can a “PUSH” instruction's operation be perfor
- SSE Comparison Intrinsics - How to get 1 or 0 from
相关文章
- Is it possible to run 16 bit code in an operating
- parallelizing matrix multiplication through thread
- Select unique/deduplication in SSE/AVX
- SIMD/SSE: How to check that all vector elements ar
- x86 instruction encoding tables
- x86 Program Counter abstracted from microarchitect
- Assembler : why BCD exists?
- Fastest way to compute distance squared
There's no reason to ever use
_mm256_lddqu_si256
, consider it a synonym for_mm256_loadu_si256
.lddqu
only exists for historical reasons as x86 evolved towards having better unaligned vector load support, and CPUs that support the AVX version run them identically. There's no AVX512 version.Compilers do still respect the
lddqu
intrinsic and emit that instruction, so you could use it if you want your code to run identically but have a different checksum or machine code bytes.No x86 microarchitectures run
vlddqu
any differently fromvmovdqu
. I.e. the two opcodes probably decode to the same internal uop on all AVX CPUs. They probably always will, unless some very-low-power or specialized microarchitecture comes along without efficient unaligned vector loads (which have been a thing since Nehalem). Compilers never usevlddqu
when auto-vectorizing.lddqu
was different frommovdqu
on Pentium 4. See History of … one CPU instructions: Part 1. LDDQU/movdqu explained.lddqu
is allowed to (and on P4 does do) two aligned 16B loads and takes a window of that data.movdqu
architecturally only ever loads from the expected 16 bytes. This has implications for store-forwarding: if you're loading data that was just stored with an unaligned store, usemovdqu
because store-forwarding only works for loads that are fully contained within a previous store. But otherwise you generally always wanted to uselddqu
. (This is why they didn't just makemovdqu
always use "the good way", and instead introduced a new instruction for programmers to worry about. But luckily for us, they changed the design so we don't have to worry about which unaligned load instruction to use anymore.)It also has implications for correctness of observable behaviour on UnCacheable (UC) or Uncacheable Speculate Write-combining (UCSW, aka WC) memory types (which might have MMIO registers behind them.)
There's no code-size difference in the two asm instructions:
On Core2 and later, there's no reason to use
lddqu
, but also no downside vs.movdqu
. Intel dropped the speciallddqu
stuff for Core2, so both options suck equally.On Core2 specifically, avoiding cache-line splits in software with two aligned loads and SSSE3
palignr
is sometimes a win vs.movdqu
, especially on 2nd-gen Core2 (Penryn) wherepalignr
is only one shuffle uop instead of 2 on Merom/Conroe. (Penryn widened the shuffle execution unit to 128b).See Dark Shikaris's 2009 Diary Of An x264 Developer blog post: Cacheline splits, take two for more about unaligned-load strategies back in the bad old days.
The generation after Core2 is Nehalem, where
movdqu
is a single uop instruction with dedicated hardware support in the load ports. It's still useful to tell compilers when pointers are aligned (especially for auto-vectorization, and especially without AVX), but it's not a performance disaster for them to just usemovdqu
everywhere, especially if the data is in fact aligned at run-time.I don't know why Intel even made an AVX version of
lddqu
at all. I guess it's simpler for the decoders to just treat that opcode as an alias formovdqu
/vmovdqu
in all modes (with legacy SSE prefixes, or with AVX128 / AVX256), instead of having that opcode decode to something else with VEX prefixes.All current AVX-supporting CPUs have efficient hardware unaligned-load / store support that handles it as optimally as possible. e.g. when the data is aligned at runtime, there's exactly zero performance difference vs.
vmovdqa
.This was not the case before Nehalem;
movdqu
andlddqu
used to decode to multiple uops to handle potentially-misaligned addresses, instead of putting hardware support for that right in the load ports where a single uop can activate it instead of faulting on unaligned addresses.However, Intel's ISA ref manual entry for
lddqu
says the 256b version can load up to 64 bytes (implementation dependent):IDK how much of that was written deliberately, and how much of that just came from prepending
(V)
when updating the entry for AVX. I don't think Intel's optimization manual recommends really usingvlddqu
anywhere, but I didn't check.There is no AVX512 version of
vlddqu
, so I think that means Intel has decided that an alternate-strategy unaligned load instruction is no longer useful, and isn't even worth keeping their options open.