Fastest sort of fixed length 6 int array

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 ints?

As the question is very low level:

we can't assume libraries are available (and the call itself has its cost), only plain C
to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

the example is simple and easy to understand and measure, not much C skill involved
it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2

Direct call to qsort library function : 689.38
Naive implementation (insertion sort) : 285.70
Insertion Sort (Daniel Stutzbach) : 142.12
Insertion Sort Unrolled : 125.47
Rank Order : 102.26
Rank Order with registers : 58.03
Sorting Networks (Daniel Stutzbach) : 111.68
Sorting Networks (Paul R) : 66.36
Sorting Networks 12 with Fast Swap : 58.86
Sorting Networks 12 reordered Swap : 53.74
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.54
Reordered Sorting Network w/ fast swap V2 : 33.63
Inlined Bubble Sort (Paolo Bonzini) : 48.85
Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1

Direct call to qsort library function : 705.93
Naive implementation (insertion sort) : 135.60
Insertion Sort (Daniel Stutzbach) : 142.11
Insertion Sort Unrolled : 126.75
Rank Order : 46.42
Rank Order with registers : 43.58
Sorting Networks (Daniel Stutzbach) : 115.57
Sorting Networks (Paul R) : 64.44
Sorting Networks 12 with Fast Swap : 61.98
Sorting Networks 12 reordered Swap : 54.67
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.24
Reordered Sorting Network w/ fast swap V2 : 33.07
Inlined Bubble Sort (Paolo Bonzini) : 45.79
Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

Comments on proposed solutions

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).

update:

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

Sorting Networks 12 with Simple Swap

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

标签： c algorithm optimization sorting gpgpu

23条回答

十年一品温如言

2楼-- · 2019-01-01 04:42

Try 'merging sorted list' sort. :) Use two array. Fastest for small and big array.
If you concating, you only check where insert. Other bigger values you not need compare (cmp = a-b>0).
For 4 numbers, you can use system 4-5 cmp (~4.6) or 3-6 cmp (~4.9). Bubble sort use 6 cmp (6). Lots of cmp for big numbers slower code.
This code use 5 cmp (not MSL sort):
if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);} if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);} if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);} if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);} if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}

Principial MSL9 8 7 6 5 4 3 2 1 0 89 67 45 23 01 ... concat two sorted lists, list length = 1 6789 2345 01 ... concat two sorted lists, list length = 2 23456789 01 ... concat two sorted lists, list length = 4 0123456789 ... concat two sorted lists, list length = 8

js code

function sortListMerge_2a(cmp)	
{
var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles;
var start = 0;
var end   = arr_count;
//var str = '';
cycles = 0;
if (end>3)
	{
	stepmax = ((end - start + 1) >> 1) << 1;
	m = 1;
	n = 2;
	for (step=1;step<stepmax;step<<=1)	//bounds 1-1, 2-2, 4-4, 8-8...
		{
		a = start;
		while (a<end)
			{
			b = a + step;
			c = a + step + step;
			b = b<end ? b : end;
			c = c<end ? c : end;
			i = a;
			j = b;
			k = i;
			while (i<b && j<c)
				{
				if (cmp(arr[m][i],arr[m][j])>0)
					{arr[n][k] = arr[m][j]; j++; k++;}
				else	{arr[n][k] = arr[m][i]; i++; k++;}
				}
			while (i<b)
				{arr[n][k] = arr[m][i]; i++; k++;
}
			while (j<c)
				{arr[n][k] = arr[m][j]; j++; k++;
}
			a = c;
			}
		tmp = m; m = n; n = tmp;
		}
	return m;
	}
else
	{
	// sort 3 items
	sort10(cmp);
	return m;
	}
}

0人赞添加讨论(0) 举报

怪性笑人.

3楼-- · 2019-01-01 04:43

While I really like the swap macro provided:

#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }

I see an improvement (which a good compiler might make):

#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }

We take note of how min and max work and pull the common sub-expression explicitly. This eliminates the min and max macros completely.

0人赞添加讨论(0) 举报

深知你不懂我心

4楼-- · 2019-01-01 04:43

If insertion sort is reasonably competitive here, I would recommend trying a shellsort. I'm afraid 6 elements is probably just too little for it to be among the best, but it might be worth a try.

Example code, untested, undebugged, etc. You want to tune the inc = 4 and inc -= 3 sequence to find the optimum (try inc = 2, inc -= 1 for example).

static __inline__ int sort6(int * d) {
    char j, i;
    int tmp;
    for (inc = 4; inc > 0; inc -= 3) {
        for (i = inc; i < 5; i++) {
            tmp = a[i];
            j = i;
            while (j >= inc && a[j - inc] > tmp) {
                a[j] = a[j - inc];
                j -= inc;
            }
            a[j] = tmp;
        }
    }
}

I don't think this will win, but if someone posts a question about sorting 10 elements, who knows...

According to Wikipedia this can even be combined with sorting networks: Pratt, V (1979). Shellsort and sorting networks (Outstanding dissertations in the computer sciences). Garland. ISBN 0-824-04406-1

0人赞添加讨论(0) 举报

看淡一切

5楼-- · 2019-01-01 04:43

Here are three typical sorting methods that represent three different classes of Sorting Algorithms:

Insertion Sort: Θ(n^2)

Heap Sort: Θ(n log n)

Count Sort: Θ(3n)

But check out Stefan Nelsson discussion on the fastest sorting algorithm? where he discuss a solution that goes down to O(n log log n) .. check out its implementation in C

This Semi-Linear Sorting algorithm was presented by a paper in 1995:

A. Andersson, T. Hagerup, S. Nilsson, and R. Raman. Sorting in linear time? In Proceedings of the 27th Annual ACM Symposium on the Theory of Computing, pages 427-436, 1995.

0人赞添加讨论(0) 举报

春风洒进眼中

6楼-- · 2019-01-01 04:46

I believe there are two parts to your question.

The first is to determine the optimal algorithm. This is done - at least in this case - by looping through every possible ordering (there aren't that many) which allows you to compute exact min, max, average and standard deviation of compares and swaps. Have a runner-up or two handy as well.
The second is to optimize the algorithm. A lot can be done to convert textbook code examples to mean and lean real-life algorithms. If you realize that an algorithm can't be optimized to the extent required, try a runner-up.

I wouldn't worry too much about emptying pipelines (assuming current x86): branch prediction has come a long way. What I would worry about is making sure that the code and data fit in one cache line each (maybe two for the code). Once there fetch latencies are refreshingly low which will compensate for any stall. It also means that your inner loop will be maybe ten instructions or so which is right where it should be (there are two different inner loops in my sorting algorithm, they are 10 instructions/22 bytes and 9/22 long respectively). Assuming the code doesn't contain any divs you can be sure it will be blindingly fast.

0人赞添加讨论(0) 举报

长期被迫恋爱

7楼-- · 2019-01-01 04:48

I ported the test suite to a PPC architecture machine I can not identify (didn't have to touch code, just increase the iterations of the test, use 8 test cases to avoid polluting results with mods and replace the x86 specific rdtsc):

Direct call to qsort library function : 101

Naive implementation (insertion sort) : 299

Insertion Sort (Daniel Stutzbach) : 108

Insertion Sort Unrolled : 51

Sorting Networks (Daniel Stutzbach) : 26

Sorting Networks (Paul R) : 85

Sorting Networks 12 with Fast Swap : 117

Sorting Networks 12 reordered Swap : 116

Rank Order : 56

0人赞添加讨论(0) 举报

1 2 3 4 下一页