This question already has an answer here:
I would like to know more about the _mm_lddqu_si128
intrinsic (lddqu
instruction since SSE3) particularly compared with the _mm_loadu_si128
intrinsic (movdqu instruction since SSE2) .
I only discovered _mm_lddqu_si128
today. The intel intrinsic guide says
this intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary
and a comment says it
will perform better under certain circumstances, but never perform worse.
So why is it not used more (SSE3 is a pretty low bar since all Core2 processors have it)? Why may it perform better when data crosses a cache line? Is lddqu
only possibly better on a certain subset of processors. E.g. before Nehalem?
I realize I could read through an Intel manual to probably find the answer but I think this question may be interesting to other people.
lddqu
used a different strategy thanmovdqu
on P4, but runs identically on all other CPUs that support it. There's no particular downside (since SSE3 instructions don't take any extra bytes of machine code, and are fairly widely supported even by AMD at this point), but no upside at all unless you care about P4.Dark Shikari (one of the x264 video encoder lead developers, responsible for a lot of SSE speedups) went into detail about it in a blog post in 2008. This is an archive.org link since the original is offline, but there's a lot of good stuff in his blog.
The most interesting point he makes is that Core2 still has slow unaligned loads, where manually doing two aligned loads and a
palignr
can be faster, but is only available with an immediate shift count. Since Core2 runslddqu
the same asmovdqu
, it doesn't help.Apparently Core1 does implement
lddqu
specially, so it's not just P4 after all.This Intel blog post about the history of lddqu/movdqu (which I found in 2 seconds with google for
lddqu vs movdqu
, /scold @Zboson) explains:So I guess this explains why they didn't just use that strategy to implement
movdqu
all the time.I guess the decoders don't have the memory-type information available, and that's when the decision has to be made on which uops to decode the instruction to. So trying to be "smart" about using the better strategy opportunistically on WB memory probably wasn't possible, even if it was desirable. (Which it isn't because of store-forwarding).
The summary from that blog post: