Fast in-register sort of bytes?

2019-04-29 03:44发布

Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions.

Thanks in advance.

4条回答
该账号已被封号
2楼-- · 2019-04-29 04:08

Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.)

查看更多
Evening l夕情丶
3楼-- · 2019-04-29 04:16

Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms" by Furtak, Amaral, and Niewiadomski. Section 4.

It uses 4 SSE registers, has 12 steps, and runs in 19 instructions including load and store.

The same paper has some excellent work on dynamically making sorting networks with SIMD.

查看更多
仙女界的扛把子
4楼-- · 2019-04-29 04:16

To speed up sorting of strings, I ended up packing 7 bytes per double and sorting (ranking) an array of 16 doubles in SSE2, using bitonic sort to create two runs of 8, and a binary merge to merge the two runs. You can see the first part here http://mischasan.wordpress.com/2011/07/29/okay-one-more-poke-at-sse2-sorting-doubles/ (asm) and here http://mischasan.wordpress.com/2011/09/02/update-on-bitonic-sse2-sort-of-16-doubles/ (C), and the bitonic merge step (if you want to go SSE all the way) here: http://mischasan.wordpress.com/2012/11/04/sse2-odd-even-merge-the-last-step-in-sorting/ . I replaced the insertion sort at the bottom of qsort with this sort, and it's about 5 times as fast as straight qsort. HTH

I hadn't seen the UofA paper; the bitonic logic is from old school (CTM) GPGPU programming.

Sorry about the embedded link strings; I don't know how to add clickable links in comments stackoverflow.

查看更多
走好不送
5楼-- · 2019-04-29 04:25

All sorting algorithms require "swapping" values from one place to another. Since you're talking about a literal CPU register, that means any sort would need another register to use as a temporary place to hold the bytes being swapped.

I've never seen a chip with a built-in method for sorting bytes within a register. Not saying it hasn't been done, but I can't think of many uses for such an instruction.

查看更多
登录 后发表回答