Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for.
In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However, this does not apply to floats.
Since the AVX2 contains registers of 256 bits, each register would fit eight 32 bits integers (floats). But the question is:
How to transpose an 8x8 float matrix, using AVX/AVX2, with the smallest instructions possible?
Further to previous answers, the usage of shuffleps is pretty much overkill in this scenario, since we can just unpacklo/unpackhi our way to the result. shuffle & unpack instructions have the same latency/throughput, however shuffle generates an additional byte in the machine code op (i.e. 5 bytes for shuffle, 4 for unpack).
At some point, we will require 8 permutes across lanes. This is a slower operation (at 3 cycles latency), so we want to kick off those ops earlier if possible. Assuming the transpose8f method gets inlined (it should do!), then any loads required for the a->h args should be fused into the unpack instructions.
The only minor issue you may face is that because you are using more than 8 registers here, you may spill into YMM9 and up. That can cause VEX2 ops to be generated as VEX3, which will add a byte per op.
As a result, with a bit of jiggling around, you'll end up with this:
You won't improve on this (You can do the 128bit permutes first, and unpacks second, but they'll end up being identical).
This is my solution with less instructions and the performance is very good about 8 times faster. I've tested using ICC, GCC and Clang in Fedora.
I already answered this question Fast memory transpose with SSE, AVX, and OpenMP.
Let me repeat the solution for transposing an 8x8 float matrix with AVX. Let me know if this is any faster than using 4x4 blocks and
_MM_TRANSPOSE4_PS
. I used it for a kernel in a larger matrix transpose which was memory bound so that was probably not a fair test.Based on this comment I learned that there are more efficient methods which to do the 8x8 transpose. See Example 11-19 and and 11-20 in the Intel optimization manual under section "11.11 Handling Port 5 Pressure". Example 11-19 uses the same number of instructions but reduces the pressure on port5 by using blends which go to port0 as well. I may implement this with intrinsics at some point but I don't have a need for this at this point.
I looked more carefully into Example 11-19 and 11-20 in the Intel Manuals I mentioned above. It turns out that example 11-19 uses 4 more shuffle operations than necessary. It has 8 unpack, 12 shuffles, and 8 128-bit permutes. My method uses 4 fewer shuffles. They replace 8 of the shuffles with blends. So 4 shuffles and 8 blends. I doubt that's better than my method with only eight shuffles.
Example 11-20 is, however, an improvement if you need to load the matrix from memory. This uses 8 unpacks, 8 inserts, 8 shuffles, 8 128-bit loads, and 8 stores. The 128-bit loads reduce the port pressure. I went ahead and implemented this using intrinsics.
So I looked into example 11-19 again. The basic idea as far as I can tell is that two shuffle instructions (shufps) can be replaced by one shuffle and two blends. For example
can be replace with
This explains why my original code used 8 shuffles and Example 11-19 uses 4 shuffles and eight blends.
The blends are good for throughput because shuffles only go to one port (creating a bottleneck on the shuffle port), but blends can run on multiple ports and thus don't compete. But what is better: 8 shuffles or 4 shuffles and 8 blends?
This has to be tested, and can depend on surrounding code. If you mostly bottleneck on total uop throughput with a lot of other uops in the loop that don't need port 5, you might go for the pure shuffle version. Ideally you should do some computation on the transposed data before storing it, while it's already in registers. See https://agner.org/optimize/ and other performance links in the x86 tag wiki.
I don't, however, see a way to replace the unpack instructions with blends.
Here is full code which combines Example 11-19 converting 2 shuffles to 1 shuffle and two blends and Example 11-20 which uses
vinsertf128
loads (which on Intel Haswell/Skylake CPUs are 2 uops: one ALU for any port, one memory. They unfortunately don't micro-fuse.vinsertf128
with all register operands is 1 uop for the shuffle port on Intel, so this is good because the compiler folds the load into a memory operand forvinsertf128
.) This has the advantage of only needing the source data 16-byte aligned for maximum performance, avoiding any cache-line splits.I decided to do a full test of 3 different routines in an apples to apples comparison.
This is benchmarking latency, not throughput (because the output for one transpose is the input for the next), but it probably bottlenecks on shuffle throughput anyway.
Results on Skylake i7-6700k @ 3.9GHz for a modified version of the above code (see it on the Godbolt compiler explorer), fixing the following bugs:
printf
outside the timed regions, before starting theclock()
volatile dummy = in[2]
at the end so all the transposes don't optimize away (which gcc actually does otherwise).alignas(32)
instead of__declspec
, and don't includestdafx.h
.)I didn't fix the unnecessary mixing of
__m256i*
/__m256*
, and I didn't check if that led to worse code-gen with gcc or clang. I also didn't use astd::chrono
high-rez clock becauseclock()
was accurate enough for this many repeats.g++7.3
-O3 -march=native
on Arch Linux: Z Boson's version is fastestclang++ 5.0.1
-O3 -march=native
: 8x8Permute gets optimized to something even faster than anything gcc did, but 8x8Insert is pessimized horribly.The asm instructions generated from the source won't match the intrinsics exactly: especially clang has a shuffle optimizer that really compiles the shuffles the same way it optimizes scalar code like
+
on integers.Transpose8x8Insert
should not be that much slower, so clang must have chosen poorly.Here's an AVX2 solution which works for 8 x 8 32 bit ints. You can of course cast float vectors to int and back if you want to transpose 8 x 8 floats. It might also be possible to do an AVX-only version (i.e. not requiring AVX2) just for floats but I haven't tried that yet.
Transpose_8_8
compiles to around 56 instructions with clang, including loads and stores - I think it should be possible to improve on this with some more effort.Compile and test: