I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling.
I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0)
From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5.
I googled a bit, but couldn't find a explanation. Is haswell really slower with legacy SSE, or did I just hit some extreme corner case? Any suggestions to dodge this bullet (other than AVX2, which is a known option but due to updating toolchain to new version postponed for now)
General remarks or suggested improvements are also welcome.
// r8 and r9 are #bytes to go to the next line in resp. src and dest
// r12=3*r8 r13=3*r9
// load 8x8 bytes into 4 registers, bytes interleaved.
movq xmm1,[rcx]
movq xmm4,[rcx+2*r8]
PUNPCKLBW xmm1,xmm4 // 0 2 0 2 0 2
movq xmm7,[rcx+r8]
movq xmm6,[rcx+r12]
PUNPCKLBW xmm7,xmm6 // 1 3 1 3 1 3
movdqa xmm2,xmm1
punpcklbw xmm1,xmm7 // 0 1 2 3 0 1 2 3 in xmm1:xmm2
punpckhbw xmm2,xmm7
lea rcx,[rcx+4*r8]
// same for 4..7
movq xmm3,[rcx]
movq xmm5,[rcx+2*r8]
PUNPCKLBW xmm3,xmm5
movq xmm7,[rcx+r8]
movq xmm8,[rcx+r12]
PUNPCKLBW xmm7,xmm8
movdqa xmm4,xmm3
punpcklbw xmm3,xmm7
punpckhbw xmm4,xmm7
// now we join one "low" dword from XMM1:xmm2 with one "high" dword
// from XMM3:xmm4
movdqa xmm5,xmm1
pextrd eax,xmm3,0
pinsrd xmm5,eax,1
movq [rdx],xmm5
movdqa xmm5,xmm3
pextrd eax,xmm1,1
pinsrd xmm5,eax,0
movq [rdx+r9],xmm5
movdqa xmm5,xmm1
pextrd eax,xmm3,2
pinsrd xmm5,eax,3
MOVHLPS xmm6,xmm5
movq [rdx+2*r9],xmm6
movdqa xmm5,xmm3
pextrd eax,xmm1,3
pinsrd xmm5,eax,2
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
lea rdx,[rdx+4*r9]
movdqa xmm5,xmm2
pextrd eax,xmm4,0
pinsrd xmm5,eax,1
movq [rdx],xmm5
movdqa xmm5,xmm4
pextrd eax,xmm2,1
pinsrd xmm5,eax,0
movq [rdx+r9],xmm5
movdqa xmm5,xmm2
pextrd eax,xmm4,2
pinsrd xmm5,eax,3
MOVHLPS xmm6,xmm5
movq [rdx+2*r9],xmm6
movdqa xmm5,xmm4
pextrd eax,xmm2,3
pinsrd xmm5,eax,2
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
lea rdx,[rdx+4*r9]
purpose:
It is really rotating images from a camera for image vision purposes . In some (heavier)apps the rotation is postponed and done display-only (opengl), in some it is easier to rotate input then to adapt algorithms.
updated code: I posted some final code here Speedup was very dependent on size of input. Large on small images, but still a factor two on larger ones compared to looptiling HLL code with a 32x32 tile. (same algo as asm code linked)
TL:DR: use punpckl/hdq
to save a massive number of shuffles in the dword-rearranging step, exactly like the transpose code in A better 8x8 bytes matrix transpose with SSE?
Your memory layout requires storing the low / high 8 bytes of each vector result separately, which you can do efficiently with movq [rdx], xmm
/ movhps [rdx+r9], xmm
.
The code is almost twice times as slow on Haswell/Skylake than on IVB
Your code bottlenecks heavily on shuffle throughput.
Haswell only has one shuffle execution unit, on port 5. SnB/IvB have 2 integer shuffle units (but still only one FP shuffle unit). See Agner Fog's instruction tables and optimization guide / microarch guide.
I see you already found David Kanter's excellent Haswell microarch write-up.
It's very easy to bottleneck on shuffle (or port5 in general) throughput for code like this, and it often gets worse with AVX / AVX2 because many shuffles are in-lane only. AVX for 128-bit ops might help, but I don't think you'll gain anything from shuffling into 256b vectors and then shuffling them apart again into 64-bit chunks. If you could load or store contiguous 256b chunks, it would be worth trying.
You have some simple missed-optimizations, even before we think about major changes:
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
should be movhps [rdx+r13],xmm6
. On Sandybridge and Haswell, movhps
is a pure store uop, with no shuffle uop required.
pextrd eax,xmm3,0
is always worse than movd eax, xmm3
; never use pextrd
with an immediate 0. (Also, pextrd
directly to memory might be a win. You could do a 64-bit movq
and then overwrite half of that with a 32-bit pextrd
. You might then bottleneck on store throughput. Also, on Sandybridge, indexed addressing modes don't stay micro-fused, so more stores would hurt your total uop throughput. But Haswell doesn't have that problem for stores, only for some indexed loads depending on the instruction.) If you use more stores some places and more shuffles other places, you could use more stores for the single-register addressing modes.
Source and destination format is not a degree of freedom in image manipulation.
Depends what you're doing. x264 (the open-source h.264 video encoder) copies 8x8 blocks into contiguous buffers before working with them repeatedly, so the stride between rows is an assemble-time constant.
This saves passing a stride in a register and doing stuff like you're doing with [rcx+2*r8]
/ [rcx+r8]
. It also lets you load two rows with one movdqa
. And it gives you good memory locality for accessing 8x8 blocks.
Of course it's probably not a win to spend time copying in/out of this format if this rotation is all you're doing with an 8x8 block of pixels. FFmpeg's h.264 decoder (which uses many of of the same asm primitives as x264) doesn't use this, but IDK if that's because nobody ever bothered to port the updated x264 asm or if it's just not worth it.
// now we join one "low" dword from XMM1:xmm2 with one "high" dword
// from XMM3:xmm4
extract/insert to/from integer is not very efficient; pinsrd
and pextrd
are 2 uops each, and one of those uops is a shuffle. You might even come out ahead of your current code just using pextrd
to memory in 32-bit chunks.
Also consider using SSSE3 pshufb
which can put your data anywhere it needs to be, and zero other elements. This can set you up for merging with por
. (You might use pshufb
instead of punpcklbw
).
Another option is to use shufps
to combine data from two sources. You might need another shuffle after that. Or use punpckldq
.
// "low" dwords from XMM1:xmm2
// high dwords from XMM3:xmm4
; xmm1: [ a b c d ] xmm2: [ e f g h ]
; xmm3: [ i j k l ] xmm4: [ m n o p ]
; want: [ a i b j ] / [ c k d l ] / ... I think.
;; original: replace these with
; movdqa xmm5,xmm1 ; xmm5 = [ a b c d ]
; pextrd eax,xmm3,0 ; eax = i
; pinsrd xmm5,eax,1 ; xmm5 = [ a i ... ]
; movq [rdx],xmm5
; movdqa xmm5,xmm3 ; xmm5 = [ i j k l ]
; pextrd eax,xmm1,1 ; eax = b
; pinsrd xmm5,eax,0 ; xmm5 = [ b j ... ]
; movq [rdx+r9],xmm5
Replace with this:
movdqa xmm5, xmm1
punpckldq xmm5, xmm3 ; xmm5 = [ a i b j ]
movq [rdx], xmm5
movhps [rdx+r9], xmm5 ; still a pure store, doesn't cost a shuffle
So we've replaced 4 shuffle uops with 1, and lowered the total uop count from 12 fused-domain uops (Haswell) to 4. (Or on Sandybridge, from 13 to 5 because the indexed store doesn't stay micro-fused).
Use punpckhdq
for [ c k d l ]
, where it's even better because we're replacing movhlps
as well.
; movdqa xmm5,xmm1 ; xmm5 = [ a b c d ]
; pextrd eax,xmm3,2 ; eax = k
; pinsrd xmm5,eax,3 ; xmm5 = [ a b c k ]
; MOVHLPS xmm6,xmm5 ; xmm6 = [ c k ? ? ] (false dependency on old xmm6)
; movq [rdx+2*r9],xmm6
Then unpack lo/hi for xmm2 and xmm4.
Using AVX or AVX2 would let you skip the movdqa
, because you can unpack into a new destination register instead of copy + destroy.
Inserting dword is more if not most efficiently done with combination of pshufd and an immediate blend.
pshufd xmm5, xmm3, 0x55 * slot
pblendw xmm1, xmm5, 3 << dst_slot
pblendw is SSE4.1, but is of course available on haswell. Unfortunately it only runs on port 5 on Haswell/Skylake, so it still competes with shuffles.
AVX2 vpblendd
runs on any vector-ALU port (p0/p1/p5) on Haswell/Skylake, so is much more efficient than word-granularity pblendw
/ vpblendw
.
If you need to avoid AVX2, consider using SSE4.1 blendps
to blend 32-bit elements with an immediate control. It runs on any port on Haswell (or p0/p5 on Sandybridge vs. p1/p5 for shuffles), and the latency penalty for using it on integer data should be irrelevant for your case.