Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions.
Thanks in advance.
Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions.
Thanks in advance.
Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.)
Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms" by Furtak, Amaral, and Niewiadomski. Section 4.
It uses 4 SSE registers, has 12 steps, and runs in 19 instructions including load and store.
The same paper has some excellent work on dynamically making sorting networks with SIMD.
To speed up sorting of strings, I ended up packing 7 bytes per double and sorting (ranking) an array of 16 doubles in SSE2, using bitonic sort to create two runs of 8, and a binary merge to merge the two runs. You can see the first part here http://mischasan.wordpress.com/2011/07/29/okay-one-more-poke-at-sse2-sorting-doubles/ (asm) and here http://mischasan.wordpress.com/2011/09/02/update-on-bitonic-sse2-sort-of-16-doubles/ (C), and the bitonic merge step (if you want to go SSE all the way) here: http://mischasan.wordpress.com/2012/11/04/sse2-odd-even-merge-the-last-step-in-sorting/ . I replaced the insertion sort at the bottom of qsort with this sort, and it's about 5 times as fast as straight qsort. HTH
I hadn't seen the UofA paper; the bitonic logic is from old school (CTM) GPGPU programming.
Sorry about the embedded link strings; I don't know how to add clickable links in comments stackoverflow.
All sorting algorithms require "swapping" values from one place to another. Since you're talking about a literal CPU register, that means any sort would need another register to use as a temporary place to hold the bytes being swapped.
I've never seen a chip with a built-in method for sorting bytes within a register. Not saying it hasn't been done, but I can't think of many uses for such an instruction.