Fastest sort of fixed length 6 int array

2019-01-01 04:22发布

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 ints?

As the question is very low level:

  • we can't assume libraries are available (and the call itself has its cost), only plain C
  • to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
  • room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

  • the example is simple and easy to understand and measure, not much C skill involved
  • it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2

  • Direct call to qsort library function : 689.38
  • Naive implementation (insertion sort) : 285.70
  • Insertion Sort (Daniel Stutzbach) : 142.12
  • Insertion Sort Unrolled : 125.47
  • Rank Order : 102.26
  • Rank Order with registers : 58.03
  • Sorting Networks (Daniel Stutzbach) : 111.68
  • Sorting Networks (Paul R) : 66.36
  • Sorting Networks 12 with Fast Swap : 58.86
  • Sorting Networks 12 reordered Swap : 53.74
  • Sorting Networks 12 reordered Simple Swap : 31.54
  • Reordered Sorting Network w/ fast swap : 31.54
  • Reordered Sorting Network w/ fast swap V2 : 33.63
  • Inlined Bubble Sort (Paolo Bonzini) : 48.85
  • Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1

  • Direct call to qsort library function : 705.93
  • Naive implementation (insertion sort) : 135.60
  • Insertion Sort (Daniel Stutzbach) : 142.11
  • Insertion Sort Unrolled : 126.75
  • Rank Order : 46.42
  • Rank Order with registers : 43.58
  • Sorting Networks (Daniel Stutzbach) : 115.57
  • Sorting Networks (Paul R) : 64.44
  • Sorting Networks 12 with Fast Swap : 61.98
  • Sorting Networks 12 reordered Swap : 54.67
  • Sorting Networks 12 reordered Simple Swap : 31.54
  • Reordered Sorting Network w/ fast swap : 31.24
  • Reordered Sorting Network w/ fast swap V2 : 33.07
  • Inlined Bubble Sort (Paolo Bonzini) : 45.79
  • Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

Comments on proposed solutions

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).

update:

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

Sorting Networks 12 with Simple Swap

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

23条回答
与君花间醉酒
2楼-- · 2019-01-01 04:57

I stumbled onto this question from Google a few days ago because I also had a need to quickly sort a fixed length array of 6 integers. In my case however, my integers are only 8 bits (instead of 32) and I do not have a strict requirement of only using C. I thought I would share my findings anyways, in case they might be helpful to someone...

I implemented a variant of a network sort in assembly that uses SSE to vectorize the compare and swap operations, to the extent possible. It takes six "passes" to completely sort the array. I used a novel mechanism to directly convert the results of PCMPGTB (vectorized compare) to shuffle parameters for PSHUFB (vectorized swap), using only a PADDB (vectorized add) and in some cases also a PAND (bitwise AND) instruction.

This approach also had the side effect of yielding a truly branchless function. There are no jump instructions whatsoever.

It appears that this implementation is about 38% faster than the implementation which is currently marked as the fastest option in the question ("Sorting Networks 12 with Simple Swap"). I modified that implementation to use char array elements during my testing, to make the comparison fair.

I should note that this approach can be applied to any array size up to 16 elements. I expect the relative speed advantage over the alternatives to grow larger for the bigger arrays.

The code is written in MASM for x86_64 processors with SSSE3. The function uses the "new" Windows x64 calling convention. Here it is...

PUBLIC simd_sort_6

.DATA

ALIGN 16

pass1_shuffle   OWORD   0F0E0D0C0B0A09080706040503010200h
pass1_add       OWORD   0F0E0D0C0B0A09080706050503020200h
pass2_shuffle   OWORD   0F0E0D0C0B0A09080706030405000102h
pass2_and       OWORD   00000000000000000000FE00FEFE00FEh
pass2_add       OWORD   0F0E0D0C0B0A09080706050405020102h
pass3_shuffle   OWORD   0F0E0D0C0B0A09080706020304050001h
pass3_and       OWORD   00000000000000000000FDFFFFFDFFFFh
pass3_add       OWORD   0F0E0D0C0B0A09080706050404050101h
pass4_shuffle   OWORD   0F0E0D0C0B0A09080706050100020403h
pass4_and       OWORD   0000000000000000000000FDFD00FDFDh
pass4_add       OWORD   0F0E0D0C0B0A09080706050403020403h
pass5_shuffle   OWORD   0F0E0D0C0B0A09080706050201040300h
pass5_and       OWORD 0000000000000000000000FEFEFEFE00h
pass5_add       OWORD   0F0E0D0C0B0A09080706050403040300h
pass6_shuffle   OWORD   0F0E0D0C0B0A09080706050402030100h
pass6_add       OWORD   0F0E0D0C0B0A09080706050403030100h

.CODE

simd_sort_6 PROC FRAME

    .endprolog

    ; pxor xmm4, xmm4
    ; pinsrd xmm4, dword ptr [rcx], 0
    ; pinsrb xmm4, byte ptr [rcx + 4], 4
    ; pinsrb xmm4, byte ptr [rcx + 5], 5
    ; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer.  Same on extract
    ; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
    movd    xmm4, dword ptr [rcx]
    pinsrw  xmm4,  word ptr [rcx + 4], 2  ; word 2 = bytes 4 and 5


    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass1_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass1_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass2_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass2_and]
    paddb xmm5, oword ptr [pass2_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass3_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass3_and]
    paddb xmm5, oword ptr [pass3_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass4_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass4_and]
    paddb xmm5, oword ptr [pass4_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass5_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass5_and]
    paddb xmm5, oword ptr [pass5_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass6_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass6_add]
    pshufb xmm4, xmm5

    ;pextrd dword ptr [rcx], xmm4, 0    ; benchmarked with this
    ;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
    ;pextrb byte ptr [rcx + 5], xmm4, 5
    movd   dword ptr [rcx], xmm4
    pextrw  word ptr [rcx + 4], xmm4, 2  ; x86 is little-endian, so this is the right order

    ret

simd_sort_6 ENDP

END

You can compile this to an executable object and link it into your C project. For instructions on how to do this in Visual Studio, you can read this article. You can use the following C prototype to call the function from your C code:

void simd_sort_6(char *values);
查看更多
看淡一切
3楼-- · 2019-01-01 04:59

An XOR swap may be useful in your swapping functions.

void xorSwap (int *x, int *y) {
     if (*x != *y) {
         *x ^= *y;
         *y ^= *x;
         *x ^= *y;
     }
 }

The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

查看更多
十年一品温如言
4楼-- · 2019-01-01 04:59

Sort 4 items with usage cmp==0. Numbers of cmp is ~4.34 (FF native have ~4.52), but take 3x time than merging lists. But better less cmp operations, if you have big numbers or big text. Edit: repaired bug

Online test http://mlich.zam.slu.cz/js-sort/x-sort-x2.htm

function sort4DG(cmp,start,end,n) // sort 4
{
var n     = typeof(n)    !=='undefined' ? n   : 1;
var cmp   = typeof(cmp)  !=='undefined' ? cmp   : sortCompare2;
var start = typeof(start)!=='undefined' ? start : 0;
var end   = typeof(end)  !=='undefined' ? end   : arr[n].length;
var count = end - start;
var pos = -1;
var i = start;
var cc = [];
// stabilni?
cc[01] = cmp(arr[n][i+0],arr[n][i+1]);
cc[23] = cmp(arr[n][i+2],arr[n][i+3]);
if (cc[01]>0) {swap(n,i+0,i+1);}
if (cc[23]>0) {swap(n,i+2,i+3);}
cc[12] = cmp(arr[n][i+1],arr[n][i+2]);
if (!(cc[12]>0)) {return n;}
cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]);
if (cc[02]>0)
    {
    swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top
    cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]);
    if (cc[13]>0)
        {
        swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble
        return n;
        }
    else    {
    cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3]));  // new cc23 | c03 //repaired
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    }
else    {
    if (cc[12]>0)
        {
        swap(n,i+1,i+2);
        cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    else    {
        return n;
        }
    }
return n;
}
查看更多
明月照影归
5楼-- · 2019-01-01 05:00

Looks like I got to the party a year late, but here we go...

Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.

I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.

I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5-650 I'm getting ~34.1 cycles/sort (using -O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using -O1, beats -O2 and -O3).

#include <stdio.h>

static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
    register int x0,x1,x2,x3,x4,x5;
    x1 = d[1];
    x2 = d[2];
    SWAP(x1, x2);
    x4 = d[4];
    x5 = d[5];
    SWAP(x4, x5);
    x0 = d[0];
    SWAP(x0, x2);
    x3 = d[3];
    SWAP(x3, x5);
    SWAP(x0, x1);
    SWAP(x3, x4);
    SWAP(x1, x4);
    SWAP(x0, x3);
    d[0] = x0;
    SWAP(x2, x5);
    d[5] = x5;
    SWAP(x1, x3);
    d[1] = x1;
    SWAP(x2, x4);
    d[4] = x4;
    SWAP(x2, x3);
    d[2] = x2;
    d[3] = x3;

#undef SWAP
#undef min
#undef max
}

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
    return x;
}

void ran_fill(int n, int *a) {
    static int seed = 76521;
    while (n--) *a++ = (seed = seed *1812433253 + 12345);
}

#define NTESTS 4096
int main() {
    int i;
    int d[6*NTESTS];
    ran_fill(6*NTESTS, d);

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6*NTESTS ; i+=6) {
        sort6_fast(d+i);
    }
    cycles = rdtsc() - cycles;
    printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);

    for (i = 0; i < 6*NTESTS ; i+=6) {
        if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
            printf("d%d : %d %d %d %d %d %d\n", i,
                    d[i+0], d[i+1], d[i+2],
                    d[i+3], d[i+4], d[i+5]);
    }
    return 0;
}

I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.

Clarkdale (i5-650)
==================
Direct call to qsort library function      635.14   575.65   581.61   577.76   521.12
Naive implementation (insertion sort)      538.30   135.36   134.89   240.62   101.23
Insertion Sort (Daniel Stutzbach)          424.48   159.85   160.76   152.01   151.92
Insertion Sort Unrolled                    339.16   125.16   125.81   129.93   123.16
Rank Order                                 184.34   106.58   54.74    93.24    94.09
Rank Order with registers                  127.45   104.65   53.79    98.05    97.95
Sorting Networks (Daniel Stutzbach)        269.77   130.56   128.15   126.70   127.30
Sorting Networks (Paul R)                  551.64   103.20   64.57    73.68    73.51
Sorting Networks 12 with Fast Swap         321.74   61.61    63.90    67.92    67.76
Sorting Networks 12 reordered Swap         318.75   60.69    65.90    70.25    70.06
Reordered Sorting Network w/ fast swap     145.91   34.17    32.66    32.22    32.18

Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function      870.01   736.39   723.39   725.48   721.85
Naive implementation (insertion sort)      503.67   174.09   182.13   284.41   191.10
Insertion Sort (Daniel Stutzbach)          345.32   152.84   157.67   151.23   150.96
Insertion Sort Unrolled                    316.20   133.03   129.86   118.96   105.06
Rank Order                                 164.37   138.32   46.29    99.87    99.81
Rank Order with registers                  115.44   116.02   44.04    116.04   116.03
Sorting Networks (Daniel Stutzbach)        230.35   114.31   119.15   110.51   111.45
Sorting Networks (Paul R)                  498.94   77.24    63.98    62.17    65.67
Sorting Networks 12 with Fast Swap         315.98   59.41    58.36    60.29    55.15
Sorting Networks 12 reordered Swap         307.67   55.78    51.48    51.67    50.74
Reordered Sorting Network w/ fast swap     149.68   31.46    30.91    31.54    31.58

Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function      559.97   451.88   464.84   491.35   458.11
Naive implementation (insertion sort)      341.15   160.26   160.45   154.40   106.54
Insertion Sort (Daniel Stutzbach)          284.17   136.74   132.69   123.85   121.77
Insertion Sort Unrolled                    239.40   110.49   114.81   110.79   117.30
Rank Order                                 114.24   76.42    45.31    36.96    36.73
Rank Order with registers                  105.09   32.31    48.54    32.51    33.29
Sorting Networks (Daniel Stutzbach)        210.56   115.68   116.69   107.05   124.08
Sorting Networks (Paul R)                  364.03   66.02    61.64    45.70    44.19
Sorting Networks 12 with Fast Swap         246.97   41.36    59.03    41.66    38.98
Sorting Networks 12 reordered Swap         235.39   38.84    47.36    38.61    37.29
Reordered Sorting Network w/ fast swap     115.58   27.23    27.75    27.25    26.54

Nehalem (Xeon E5640)
====================
Direct call to qsort library function      911.62   890.88   681.80   876.03   872.89
Naive implementation (insertion sort)      457.69   236.87   127.68   388.74   175.28
Insertion Sort (Daniel Stutzbach)          317.89   279.74   147.78   247.97   245.09
Insertion Sort Unrolled                    259.63   220.60   116.55   221.66   212.93
Rank Order                                 140.62   197.04   52.10    163.66   153.63
Rank Order with registers                  84.83    96.78    50.93    109.96   54.73
Sorting Networks (Daniel Stutzbach)        214.59   220.94   118.68   120.60   116.09
Sorting Networks (Paul R)                  459.17   163.76   56.40    61.83    58.69
Sorting Networks 12 with Fast Swap         284.58   95.01    50.66    53.19    55.47
Sorting Networks 12 reordered Swap         281.20   96.72    44.15    56.38    54.57
Reordered Sorting Network w/ fast swap     128.34   50.87    26.87    27.91    28.02
查看更多
临风纵饮
6楼-- · 2019-01-01 05:02

Since these are integers and compares are fast, why not compute the rank order of each directly:

inline void sort6(int *d) {
  int e[6];
  memcpy(e,d,6*sizeof(int));
  int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
  int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
  int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
  int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
  int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
  int o5 = 15-(o0+o1+o2+o3+o4);
  d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}
查看更多
低头抚发
7楼-- · 2019-01-01 05:03

I know this is an old question.

But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,

It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin

But then it's not purely min max anymore.

What might work is doing min/max on multiple integers at once with AVX

PMINSW reference

#include <stdio.h>

static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
  unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
  "=A" (x));
  return x;
}

#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))

static __inline__ void sort6(int * in) {
  const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];

  in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );

  const int
  AB = MAX(A, B),
  AC = MAX(A, C),
  AD = MAX(A, D),
  AE = MAX(A, E),
  AF = MAX(A, F),
  BC = MAX(B, C),
  BD = MAX(B, D),
  BE = MAX(B, E),
  BF = MAX(B, F),
  CD = MAX(C, D),
  CE = MAX(C, E),
  CF = MAX(C, F),
  DE = MAX(D, E),
  DF = MAX(D, F),
  EF = MAX(E, F);

  in[1] = MIN4 (
  MIN4( AB, AC, AD, AE ),
  MIN4( AF, BC, BD, BE ),
  MIN4( BF, CD, CE, CF ),
  MIN3( DE, DF, EF)
  );

  const int
  ABC = MAX(AB,C),
  ABD = MAX(AB,D),
  ABE = MAX(AB,E),
  ABF = MAX(AB,F),
  ACD = MAX(AC,D),
  ACE = MAX(AC,E),
  ACF = MAX(AC,F),
  ADE = MAX(AD,E),
  ADF = MAX(AD,F),
  AEF = MAX(AE,F),
  BCD = MAX(BC,D),
  BCE = MAX(BC,E),
  BCF = MAX(BC,F),
  BDE = MAX(BD,E),
  BDF = MAX(BD,F),
  BEF = MAX(BE,F),
  CDE = MAX(CD,E),
  CDF = MAX(CD,F),
  CEF = MAX(CE,F),
  DEF = MAX(DE,F);

  in[2] = MIN( MIN4 (
  MIN4( ABC, ABD, ABE, ABF ),
  MIN4( ACD, ACE, ACF, ADE ),
  MIN4( ADF, AEF, BCD, BCE ),
  MIN4( BCF, BDE, BDF, BEF )),
  MIN4( CDE, CDF, CEF, DEF )
  );


  const int
  ABCD = MAX(ABC,D),
  ABCE = MAX(ABC,E),
  ABCF = MAX(ABC,F),
  ABDE = MAX(ABD,E),
  ABDF = MAX(ABD,F),
  ABEF = MAX(ABE,F),
  ACDE = MAX(ACD,E),
  ACDF = MAX(ACD,F),
  ACEF = MAX(ACE,F),
  ADEF = MAX(ADE,F),
  BCDE = MAX(BCD,E),
  BCDF = MAX(BCD,F),
  BCEF = MAX(BCE,F),
  BDEF = MAX(BDE,F),
  CDEF = MAX(CDE,F);

  in[3] = MIN4 (
  MIN4( ABCD, ABCE, ABCF, ABDE ),
  MIN4( ABDF, ABEF, ACDE, ACDF ),
  MIN4( ACEF, ADEF, BCDE, BCDF ),
  MIN3( BCEF, BDEF, CDEF )
  );

  const int
  ABCDE= MAX(ABCD,E),
  ABCDF= MAX(ABCD,F),
  ABCEF= MAX(ABCE,F),
  ABDEF= MAX(ABDE,F),
  ACDEF= MAX(ACDE,F),
  BCDEF= MAX(BCDE,F);

  in[4]= MIN (
  MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
  MIN ( ACDEF, BCDEF )
  );

  in[5] = MAX(ABCDE,F);
}

int main(int argc, char ** argv) {
  int d[6][6] = {
    {1, 2, 3, 4, 5, 6},
    {6, 5, 4, 3, 2, 1},
    {100, 2, 300, 4, 500, 6},
    {100, 2, 3, 4, 500, 6},
    {1, 200, 3, 4, 5, 600},
    {1, 1, 2, 1, 2, 1}
  };

  unsigned long long cycles = rdtsc();
  for (int i = 0; i < 6; i++) {
    sort6(d[i]);
  }
  cycles = rdtsc() - cycles;
  printf("Time is %d\n", (unsigned)cycles);

  for (int i = 0; i < 6; i++) {
    printf("d%d : %d %d %d %d %d %d\n", i,
     d[i][0], d[i][1], d[i][2],
     d[i][3], d[i][4], d[i][5]);
  }
}

EDIT:
Rank order solution inspired by Rex Kerr's, Much faster than the mess above

static void sort6(int *o) {
const int 
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
          BC = B>C, BD = B>D, BE = B>E,
                    CD = C>D, CE = C>E,
                              DE = D>E,
a =          AB + AC + AD + AE + (A>F),
b = 1 - AB      + BC + BD + BE + (B>F),
c = 2 - AC - BC      + CD + CE + (C>F),
d = 3 - AD - BD - CD      + DE + (D>F),
e = 4 - AE - BE - CE - DE      + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}
查看更多
登录 后发表回答