Sorry this answer is messy; it didn't all get written at once and I'm lazy. There is some duplication.
I have 4 separate ideas:
- Normal sorting, but moving the struct as a 64bit unit
- Vectorized insertion-sort as a building block for qsort
Sorting networks, with a comparator implementation using cmpps
/ blendvpd
instead of minps
/maxps
. The extra overhead might kill the speedup, though.
Sorting networks: load some structs, then shuffle/blend to get some registers of just floats and some registers of just payload. Use Timothy Furtak's technique of doing a normal minps
/maxps
comparator and then cmpeqps min,orig
-> masked xor-swap on the payload. This sorts twice as much data per comparator, but does require matching shuffles on two registers between comparators. Also requires re-interleaving when you're done (but that's easy with unpcklps / unpckhps, if you arrange your comparators so those in-lane unpacks will put the final data in the right order).
This also avoids potential slowdowns that some CPUs may have when doing FP comparisons on bit patterns in the payload that represent denormals, NaNs, or infinities, without resorting to setting the denormals-are-zero bit in MXCSR.
Furtak's paper suggests doing a scalar cleanup after getting things mostly sorted with vectors, which would reduce the amount of shuffling a lot.
Normal sorting
There's at least a small speedup to be gained when using normal sorting algorithms, by moving the whole struct around with 64bit loads/stores, and doing a scalar FP compare on the FP element. For this idea to work as well as possible, order your struct with the float value first, then you could movq
a whole struct into an xmm reg, and the float value would be in the low32 for ucomiss
. Then you (or maybe a smart compiler) could store the struct with a movq
.
Looking at the asm output that Kerrek SB linked to, compilers seem to do a rather bad job of efficiently copying structs around:
icc
seems to movzx the two uint values separately, rather than scooping up the whole struct in a 64b load. Maybe it doesn't pack the struct? gcc
5.1 doesn't seem to have that problem most of the time.
Speeding up insertion-sort
Big sorts usually divide-and-conquer with insertion sort for small-enough problems. Insertion sort copies array elements over by one, stopping only when we find we've reached the spot where the current element belongs. So we need to compare one element to a sequence of packed elements, stopping if the comparison is true for any. Do you smell vectors? I smell vectors.
# RSI points to struct { float f; uint... payload; } buf[];
# RDI points to the next element to be inserted into the sorted portion
# [ rsi to rdi ) is sorted, the rest isn't.
##### PROOF OF CONCEPT: debug / finish writing before using! ######
.new_elem:
vbroadcastsd ymm0, [rdi] # broadcast the whole struct
mov rdx, rdi
.search_loop:
sub rdx, 32
vmovups ymm1, [rdx] # load some sorted data
vcmplt_oqps ymm2, ymm0, ymm1 # all-ones in any element where ymm0[i] < ymm1[i] (FP compare, false if either is NaN).
vmovups [rdx+8], ymm1 # shuffle it over to make space, usual insertion-sort style
cmp rdx, rsi
jbe .endsearch # below-or-equal (addresses are unsigned)
movmskps eax, ymm2
test al, 0b01010101 # test only the compare results for
jz .search_loop # [rdi] wasn't less than any of the 4 elements
.endsearch:
# TODO: scalar loop to find out where the new element goes.
# All we know is that it's less than one of the elements in ymm1, but not which
add rdi, 8
vmovsd [rdx], ymm0
cmp rdi, r8 # pointer to the end of the buf
jle .new_elem
# worse alternative to movmskps / test:
# vtestps ymm2, ymm7 # where ymm7 is loaded with 1s in the odd (float) elements, and 0s in the even (payload) elements.
# vtestps is like PTEST, but only tests the high bit. If the struct was in the other order, with the float high, vtestpd against a register of all-1s would work, as that's more convenient to generate.
This is certainly full of bugs, and I should have just written it in C with intrinsics.
This is an insertion sort with probably more overhead than most, that might lose to a scalar version for very small problem sizes, due to the extra complexity of handling the first few element (don't fill a vector), and of figuring out where to put the new element after breaking out of the vector search loop that checked multiple elements.
Probably pipelining the loop so we haven't stored ymm1
until the next iteration (or after breaking out) would save a redundant store. Doing the compares in registers by shifting / shuffling them, instead of literally doing scalar load/compares would probably be a win. This could end up with way too many unpredictable branches, and I'm not seeing a nice way to end up with the high 4 packed in a reg for vmovups
, and the low one in another reg for vmovsd
.
I may have invented an insertion sort that's the worst of both worlds: slow for small arrays because of more work after breaking out of the search loop, but it's still insertion sort: slow for large arrays because of O(n^2). However, if the code outside the searchloop can be made non-horrible, this could be a useful as the small-array endpoint for qsort / mergesort.
Anyway, if anyone does develop this idea into actual debugged and working code, let us know.
update: Timothy Furtak's paper describes an SSE implementation for sorting short arrays (for use as a building block for bigger sorts, like this insertion sort). He suggests producing a partially-ordered result with SSE, and then doing a cleanup with scalar ops. (insertion-sort on a mostly-sorted array is fast.)
Which leads us to:
Sorting Networks
There might not be any speedup here. Xiaochen, Rocki, and Suda only report a 3.7x speedup from scalar -> AVX-512 for 32bit (int) elements, for single-threaded mergesort, on a Xeon Phi card. With wider elements, fewer fit in a vector reg. (That's a factor of 4 for us: 64b elements in 256b, vs. 32b elements in 512b.) They also take advantage of AVX512 masks to only compare some lanes, a feature not available in AVX. Plus, with a slower comparator function that competes for the shuffle/blend unit, we're already in worse shape.
Sorting networks can be constructed using SSE/AVX packed-compare instructions. (More usually, with a pair of min/max instructions that effectively do a set of packed 2-element sorts.) Larger sorts can be built up out of an operation that does pairwise sorts. This paper by Tian Xiaochen, Kamil Rocki and Reiji Suda at U of Tokyo has some real AVX code for sorting (without payloads), and discussion of how it's tricky with vector registers because you can't compare two elements that are in the same register (so the sorting network has to be designed to not require that). They use pshufd
to line up elements for the next comparison, to build up a larger sort out of sorting just a few registers full of data.
Now, the trick is to do a sort of pairs of packed 64b elements, based on the comparison of only half an element. (i.e. Keeping the payload with the sort key.) We could potentially sort other things this way, by sorting an array of (key, payload)
pairs, where the payload can be an index or 32bit pointer (mmap(MAP_32bit)
, or x32 ABI).
So let's build ourselves a comparator. In sorting-network parlance, that's an operation that sorts a pair of inputs. So it either swaps an elements between registers, or not.
# AVX comparator for SnB/IvB
# struct { uint16_t a, b; float f; } inputs in ymm0, ymm1
# NOTE: struct order with f second saves a shuffle to extend the mask
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
# vblendvpd checks the high bit of the 64b element, so mask *doesn't* need to be extended to the low32
vblendvpd ymm2, ymm1, ymm0, ymm7
vblendvpd ymm3, ymm0, ymm1, ymm7
# result: !(ymm2[i] > ymm3[i]) (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (NaN).)
# UNTESTED
You might need to set the MXCSR to make sure that int bits don't slow down your FP ops if they happen to represent a denormal or NaN float. I'm not sure if that happens only for mul/div, or if it would affect compare.
- Intel Haswell: Latency: 5 cycles for
ymm2
to be ready, 7 cycles for ymm3
. Throughput: one per 4 cycles. (p5 bottleneck).
- Intel Sandybridge/Ivybridge: Latency: 5 cycles for
ymm2
to be ready, 6 cycles for ymm3
. Throughput: one per 2 cycles. (p0/p5 bottleneck).
- AMD Bulldozer/Piledriver: (
vblendvpd ymm
: 2c lat, 2c recip tput): lat: 4c for ymm2
, 6c for ymm3
. Or worse, with bypass delays between cmpps and blend. tput: one per 4c. (bottleneck on vector P1)
- AMD Steamroller: (
vblendvpd ymm
: 2c lat, 1c recip tput): lat: 4c for ymm2
, 5c for ymm3
. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports P0/1, for cmp and blend).
VBLENDVPD
is 2 uops. (It has 3 reg inputs, so it can't be 1 uop :/). Both uops can only run on shuffle ports. On Haswell, that's only port5. On SnB, that's p0/p5. (IDK why Haswell halved the shuffle / blend throughput compared to SnB/IvB.)
If AMD designs had 256b-wide vector units, their lower-latency FP compare and single-macro-op decoding of 3-input instructions would put them ahead.
The usual minps/maxps pair is 3 and 4 cycles latency (ymm2/3
), and one per 2 cycles throughput (Intel). (p1 bottleneck on the FP add/sub/compare unit). The most fair comparison is probably to sorting 64bit doubles. The extra latency, may hurt if there aren't multiple pairs of independent registers to be compared. The halved throughput on Haswell will cut into any speedups pretty heavily.
Also keep in mind that shuffles are needed between comparator operations to get the right elements lined up for comparison. min/maxps leave the shuffle ports unused, but my cmpps/blendv version saturates them, meaning the shuffling can't overlap with comparing, except as something to fill gaps left by data dependencies.
With hyperthreading, another thread that can keep the other ports busy (e.g. port 0/1 fp mul/add units, or integer code) would share a core quite nicely with this blend-bottlenecked version.
I attempted another version for Haswell, which does the blends "manually" using bitwise AND/OR operations. It ended up slower, though, because both sources have to get masked both ways before combining.
# AVX2 comparator for Haswell
# struct { float f; uint16_t a, b; } inputs in ymm0, ymm1
#
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2) # extend the mask to the payload part. There's no mask function, I just don't want to work out the result in my head.
vpand ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where ymm0[i] < ymm1[i]
vpandn ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where !(ymm0[i] < ymm1[i])
vpor ymm2, ymm10, ymm11 # ymm2 = min_packed_mystruct(ymm0, ymm1)
vpandn ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where !(ymm0[i] < ymm1[i])
vpand ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where ymm0[i] < ymm1[i]
vpor ymm3, ymm10, ymm11 # ymm2 = max_packed_mystruct(ymm0, ymm1)
# result: !(ymm2[i] > ymm3[i])
# UNTESTED
This is 8 uops, compared to 5 for the blendv version. There's a lot of parallelism in the last 6 and/andn/or instructions. cmpps
has 3 cycle latency, though. I think ymm2
will be ready in 6 cycles, while ymm3
is ready in 7. (And can overlap with operations on ymm2
). The insns following a comparator op will probably be shuffles, to put the data in the right elements for the next compare. There's no forwarding delay to/from the shuffle unit for integer-domain logicals, even for a vshufps
, but the result should come out in the FP domain, ready for a vcmpps
. Using vpand
instead of vandps
is essential for throughput.
Timothy Furtak's paper suggests an approach for sorting keys with a payload: don't pack the payload pointers with the keys, but instead generate a mask from the compare, and use it on both the keys and the payload the same way. This means you have to separate the payload from the keys either in your data structure, or every time you load a struct.
See the appendix of his paper (Fig. 12). He uses the standard min/max on the keys, and then uses cmpps
to see which elements CHANGED. Then he ANDs that mask in the middle of an xor-swap to end up only swapping the payloads for the keys that swapped.
Unfortunately, original AVX has very limited shuffling across its 128-bit halves (i.e. lanes), so it is hard to sort contents of a full 256-bit register. However, AVX2 has shuffling operations without such limitations, so we can perform a sort of 4 structs in vectorized way.
I'll use the idea of this solution. In order to sort an array we have to do enough element comparisons to surely determine the permutation we need to apply. Given that no element is NaN, it is enough to check for each pair of different elements a and b whether a < b and whether a > b. Having this information, we can fully compare any two elements, which must be enough to determine final sorting order. This is 6 pairs of 32-bit elements and two comparison modes, so we can end up doing two shuffles and two comparisons in AVX. If you are absolutely sure that all the elements are distinct, then you can avoid a > b comparisons and reduce size of LUT.
For repacking of elements within register we can use _mm256_permutevar8x32_ps
. One instruction allows to do arbitrary shuffle on 32-bit granularity. Note that in the code I assume that sorting key f
is the first member of your struct (just as @PeterCordes proposed), but you can trivially use this solution for you current struct if you change shuffling mask accordingly.
After we perform the comparisons, we have a two AVX registers containing boolean results as 32-bit masks. The first six masks in each register are important, the last two are not. Then we want to convert these masks to a small integer in general-purpose register to be used as index in a lookup table. In general case we may have to create perfect hashing for it, but it is not necessary here. We can use _mm256_movemask_ps
to get a 8-bit integer mask in general purpose register from AVX register. Since the last two masks per register are not important, we can ensure that they are always zero. Then the resulting index would be in range [0..2^12).
Finally, we load a shuffling mask from precomputed LUT with 4096 elements and pass it to _mm256_permutevar8x32_ps
. As a result we obtain an AVX register with 4 properly sorted structs of your type. Precomputing the LUT is your home assignment =)
Here is the final code:
__m256i lut[4096]; //LUT of 128Kb size must be precomputed
__m256 Sort4(__m256 val) {
__m256 aaabbcaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(0, 0, 0, 2, 2, 4, 0, 0));
__m256 bcdcddaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(2, 4, 6, 4, 6, 6, 0, 0));
__m256 cmpLt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_LT_OQ);
__m256 cmpGt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_GT_OQ);
int idxLt = _mm256_movemask_ps(cmpLt);
int idxGt = _mm256_movemask_ps(cmpGt);
__m256i shuf = lut[idxGt * 64 + idxLt];
__m256 res = _mm256_permutevar8x32_ps(val, shuf);
return res;
}
Here you can see generated assembly. There are 14 instructions in total, 2 of them are for loading constant shuffling masks, and one of them is due to useless 32-bit->64-bit conversion of movemask
results. So in a tight loop it would be 11-12 instructions. IACA says that four calls in a loop have 16.40 cycles throughput on Haswell, so it seems to achieve throughput 4.1 cycles per call.
Of course 128 Kb lookup table is too much unless you are going to process even more input data in one batch. It may be possible to reduce LUT size with adding perfect hashing (sacrificing speed of course). It is hard to say how much orderings are possible on four elements, but clearly less than 4! * 2^3 = 192. I think 256-element LUT is possible, maybe even 128-element LUT. With perfect hashing it may be faster to combine two AVX registers into one with shift and xor, then do _mm256_movemask_epi8
once (instead of doing two _mm256_movemask_ps
and combining them afterwards).