可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Is it possible to compare in SIMD instruction and swap the values if some condition happened. In other words, I have 4 integers:

(100 5) (1 42)

And I want to receive:

(5 100) (1 42)

i.e. I want to compare pairwise (first value with second and third with fourth) and in case left operand is greater - swap the values. Is it possible to do with only 1 SIMD?

P.S.: it's the first time I'm trying SIMD and probably I'm using wrong terminology - please fix me if I'm wrong.

回答1:

It seems that you want to sort pairs of 32-bit integers in a single XMM register. There is no ready instruction for it of course, but you can do it in a few instructions with SSE4.1 (beware: code not tested):

//input = [100, 5, 1, 42]
__m128i swapped = _mm_shuffle_epi32(input, _MM_SHUFFLE(2,3,0,1)); // [5, 100, 42, 1]
__m128i comp = _mm_cmplt_epi32(input, swapped);                   // [0, -1, -1, 0]
comp = _mm_xor_si128(comp, _mm_set_epi32(-1, 0, -1, 0));          // [0, 0, -1, -1]
input = _mm_blendv_epi8(swapped, input, comp);                    // [5, 100, 1, 42]

It seems to be 7 uops and take 2 CPU cycles (throughput) on Ivy Bridge.

It can be easily ported to AVX2 if wanted.

回答2:

For AVX2 capable systems there is a solution using min/max and blend with imm (which has 1 cycle latency vs. 2 cycles for variable one).

Following code has 3 cycles latency and should have less than 2cycles throughput on HSW+

__m128i tmp = _mm_shuffle_epi32(in, _MM_SHUFFLE(2,3,0,1));
__m128i min = _mm_min_epi32(in, tmp);
__m128i max = _mm_max_epi32(in, tmp);
// __m128i res = _mm_blend_epi32(min, max, 0xA); // AVX2 only
__m128i res = _mm_blend_epi16(min, max, 0xCC);   // SSE4.1

I've tested it on my HSW system (processing 20000 pairs 100K times) and it has ~26% better performance than the code by stgatilov

CMP + VARIABLE BLEND    1.18sec
MIN/MAX + BLEND_32      0.87sec // AVX2 only code
MIN/MAX + BLEND_PS      0.86sec // SSE
MIN/MAX + PLEND_16      0.88sec // Preferred for SSE

UPDATE: Per stgatilov' comment below. All MIN/MAX implementations have virtually the same performance (most likely just stuck into memory b/w)