Find 4 minimal values in 4 __m256d registers

2019-07-14 02:11发布

问题:

I cannot figure out how to implement:

__m256d min(__m256d A, __m256d B, __m256d C, __m256d D)
{
    __m256d result;

    // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3]
    // moreover it should be result[0] <= result[1] <= result[2] <= result[2]

     return result;
}

Any ideas of how to use _mm256_min_pd, _mm256_max_pd and shuffles/permutes in a smart way?

==================================================

This where I got so far, after:

    __m256d T = _mm256_min_pd(A, B);
    __m256d Q = _mm256_max_pd(A, B);
    A = T; B = Q;
    T = _mm256_min_pd(C, D);
    Q = _mm256_max_pd(C, D);
    C = T; D = Q;
    T = _mm256_min_pd(B, C);
    Q = _mm256_max_pd(B, C);
    B = T; C = Q;
    T = _mm256_min_pd(A, B);
    Q = _mm256_max_pd(A, B);
    A = T; D = Q;
    T = _mm256_min_pd(C, D);
    Q = _mm256_max_pd(C, D);
    C = T; D = Q;
    T = _mm256_min_pd(B, C);
    Q = _mm256_max_pd(B, C);
    B = T; C = Q;

we have : A[0] < B[0] < C[0] < D[0], A[1] < B[1] < C[1] < D[1], A[2] < B[2] < C[2] < D[2], A[3] < B[3] < C[3] < D[3],

so the minimal value is among A's, second minimal is among A's or B's, ... Not sure where to go from there ...

========================================================

Second idea is that the problem is reducible to itself, but with 2 input __m256 elements. If this can be done, then just do min4(A,B) --> P, min4(C,D) --> Q, min4(P,Q) --> return value.

No idea how to that for two vectors though :)

=======================================================================

Update 2 : problem almost solved -- the following function computes 4 minimal values.

__m256d min4(__m256d A, __m256d B, __m256d C, __m256d D)
{
    __m256d T;
    T = _mm256_min_pd(A, B);
    B = _mm256_max_pd(A, B);            
    B = _mm256_permute_pd(B, 0x5);
    A = _mm256_min_pd(T, B);            
    B = _mm256_max_pd(T, B);            
    B = _mm256_permute2f128_pd(B, B, 0x1);
    T = _mm256_min_pd(A, B);
    B = _mm256_max_pd(A, B);
    B = _mm256_permute_pd(B, 0x5);
    A = _mm256_min_pd(A, B);

    T = _mm256_min_pd(C, D);
    D = _mm256_max_pd(C, D);            
    D = _mm256_permute_pd(D, 0x5);
    C = _mm256_min_pd(T, D);            
    D = _mm256_max_pd(T, D);            
    D = _mm256_permute2f128_pd(D, D, 0x1);
    T = _mm256_min_pd(C, D);
    D = _mm256_max_pd(C, D);
    D = _mm256_permute_pd(D, 0x5);
    C = _mm256_min_pd(C, D);

    T = _mm256_min_pd(A, C);
    C = _mm256_max_pd(A, C);            
    C = _mm256_permute_pd(C, 0x5);
    A = _mm256_min_pd(T, C);            
    C = _mm256_max_pd(T, C);            
    C = _mm256_permute2f128_pd(C, C, 0x1);
    T = _mm256_min_pd(A, C);
    C = _mm256_max_pd(A, C);
    C = _mm256_permute_pd(C, 0x5);
    A = _mm256_min_pd(A, C);

    return A;
};

All that remains is to sort the values in increasing order inside A before return.

回答1:

It might be best to do some SIMD compares to reduce to 8 or 4 (like you have now) candidates, and then unpack to scalar doubles in vector registers. This doesn't have to involve a memory round-trip: vextractf128 the high half (_mm256_extractf128_pd), and cast the low half. Maybe use movhlps (_mm_movehl_ps) to get the high half of a __m128 down to the low half (although on CPUs with AVX, you only save a code byte or two from using that instead of a shuffle with an immediate; it's not faster like it is on some old CPUs).

IDK whether unpacking with shuffles or simply storing is the way to go. Maybe a mix of both, to keep the shuffle ports and the store/load ports busy would do well. Obviously the low double in every vector is already in place as a scalar, so that's one you don't have to load. (And compilers are bad at seeing through a store-and-reload-as-scalars to take advantage of this, even to a local array.)

Even without narrowing down the set of candidates very much, some SIMD comparators before unpacking could reduce the amount of swaps / shuffles expected from branchy scalar code, reducing branch mispredict penalties.


As I described in comments on Paul R's answer, in scalar code you'd probably do well with an insertion-sort type of algorithm. But it's more like a priority queue: only insert into the first 4 elements. If an new candidate is greater than the biggest existing candidate, just move on. Otherwise insertion-sort it into the list of 4 candidates which you maintain in sorted order.


I found a really nice paper on SIMD sorting networks, with specific discussion of AVX. They go into detail about the required shuffles when using SIMD packed-min / packed-max instructions to sort a couple vector-registers of data. They even use intrinsics like _mm512_shuffle_epi32 in their examples. They say their results are applicable to AVX, even though they use AVX-512 mask registers in their examples.

It's only the last bit of the paper where they talk about merging to use the small sort as a building block for a big parallel sort. I can't find their actual code anywhere, so maybe they never published the full implementation they benchmarked to make their graphs. :(

BTW, I wrote a previous answer with some not-very-great ideas about sorting 64bit structs by a float member, but that's not really applicable here since I was only addressing the complications of dealing with a payload (which you don't have).


I don't have time right now to finish this answer, so I'll just post a summary of my idea:

Adapt the 2-register method from that paper to AVX (or AVX2). I'm not sure how best to emulate their AVX512 masked min/max instructions. :/ I may update this later. You might want to email the authors and ask about the code they used to benchmark the desktop CPU.

Anyway, use the 2-register function on pairs of regs, to reduce from 4 to 2 regs, then again to reduce to 1 reg. Unlike your version, theirs produces a fully-sorted output register.

Trying to avoid cross-lane shuffles whenever possible may be tricky. I'm not sure if you can gain anything from using shufpd (__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int select);) to combine data from two source regs while shuffling. The 256b version can do a different shuffle on each lane, using 4 bits of the imm8 instead of just 2.

It's an interesting problem, but I unfortunately shouldn't take the time to write up a full solution myself. If I had time, I'd like to compare an insertion-sort priority queue and a sorting-network fully-unrolled implementation of the same pqueue, on 4, 8, 12, and 16 elements each. (different levels of SIMD sorting network before going scalar).

Links I've found:

  • Another paper on SSE sorting, with some code using palignr to merge two two individually sorted vectors into a sorted 8-element pair of vectors. Not directly applicable to 256b vectors of doubles.
  • Another paper on SSE sorting, with test results from Core2 CPUs. It uses an inefficient blend/blend/shuffle for shuffling between two vectors, because of the limitations of shufps. In-lane shufpd has slightly different restrictions. This paper is probably worth looking at carefully. They have algorithms that work on real SSE vectors, with the available shuffle operations.
  • Xiaochen, Rocki, and Suda's paper linked already
  • What looks like a chapter on sorting networks from the CLR algorithms textbook.
  • Sorting network generator, with a choice of algorithms. Not SIMD-specific
  • https://en.wikipedia.org/wiki/Bitonic_sorter the example diagram has a 16-input sorting network. Bitonic sorts use patterns that can SIMD to some degree. The upper part of the end of the network can be omitted, since we only care about the order of the lowest 4.
  • Fastest sort of fixed length 6 int array. A popular question with many answers.


回答2:

This is a purely "horizontal" operation and not really suited to SIMD - I suspect it will be quicker to just stash the four vectors in memory, sort the 16 values, and then load the first four into your result vector:

__m256d min(__m256d A, __m256d B, __m256d C, __m256d D)
{
    double buff[16] __attribute__ ((aligned(32)));

    _mm256_store_pd(&buff[0], A);
    _mm256_store_pd(&buff[4], B);
    _mm256_store_pd(&buff[8], C);
    _mm256_store_pd(&buff[12], D);

    std::partial_sort(buff, buff+4, buff+16);

    return _mm256_load_pd(&buff[0]);    
}

For improved performance you could implement an inline custom sort routine which is hard-coded for 16 elements.