A faster integer SSE unalligned load that's ra

2019-02-17 23:04发布

This question already has an answer here:

I would like to know more about the _mm_lddqu_si128intrinsic (lddqu instruction since SSE3) particularly compared with the _mm_loadu_si128 intrinsic (movdqu instruction since SSE2) .

I only discovered _mm_lddqu_si128 today. The intel intrinsic guide says

this intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary

and a comment says it

will perform better under certain circumstances, but never perform worse.

So why is it not used more (SSE3 is a pretty low bar since all Core2 processors have it)? Why may it perform better when data crosses a cache line? Is lddqu only possibly better on a certain subset of processors. E.g. before Nehalem?

I realize I could read through an Intel manual to probably find the answer but I think this question may be interesting to other people.

1条回答
Explosion°爆炸
2楼-- · 2019-02-17 23:15

lddqu used a different strategy than movdqu on P4, but runs identically on all other CPUs that support it. There's no particular downside (since SSE3 instructions don't take any extra bytes of machine code, and are fairly widely supported even by AMD at this point), but no upside at all unless you care about P4.

Dark Shikari (one of the x264 video encoder lead developers, responsible for a lot of SSE speedups) went into detail about it in a blog post in 2008. This is an archive.org link since the original is offline, but there's a lot of good stuff in his blog.

The most interesting point he makes is that Core2 still has slow unaligned loads, where manually doing two aligned loads and a palignr can be faster, but is only available with an immediate shift count. Since Core2 runs lddqu the same as movdqu, it doesn't help.

Apparently Core1 does implement lddqu specially, so it's not just P4 after all.


This Intel blog post about the history of lddqu/movdqu (which I found in 2 seconds with google for lddqu vs movdqu, /scold @Zboson) explains:

(on P4 only): The instruction works by loading a 32-byte block aligned on a 16-byte boundary, extracting the 16 bytes corresponding to the unaligned access.

Because the instruction loads more bytes than requested, some usage restrictions apply. Lddqu should be avoided on Uncached (UC) and Write-Combining (USWC) memory regions. Also, by its implementation, lddqu should be avoided in situations where store-load forwarding is expected.

So I guess this explains why they didn't just use that strategy to implement movdqu all the time.

I guess the decoders don't have the memory-type information available, and that's when the decision has to be made on which uops to decode the instruction to. So trying to be "smart" about using the better strategy opportunistically on WB memory probably wasn't possible, even if it was desirable. (Which it isn't because of store-forwarding).


The summary from that blog post:

starting from Intel Core 2 brand (Core microarchitecture , from mid 2006, Merom CPU and higher) up to the future: lddqu does the same thing as movdqu

In the other words:
* if CPU supports Supplemental Streaming SIMD Extensions 3 (SSSE3) -> lddqu does the same thing as movdqu,
* If CPU doesn’t support SSSE3 but supports SSE3 -> go for lddqu (and note that story about memory types )

查看更多
登录 后发表回答