Is it possible to compare in SIMD instruction and swap the values if some condition happened. In other words, I have 4 integers:
(100 5) (1 42)
And I want to receive:
(5 100) (1 42)
i.e. I want to compare pairwise (first value with second and third with fourth) and in case left operand is greater - swap the values. Is it possible to do with only 1 SIMD?
P.S.: it's the first time I'm trying SIMD and probably I'm using wrong terminology - please fix me if I'm wrong.
It seems that you want to sort pairs of 32-bit integers in a single XMM register. There is no ready instruction for it of course, but you can do it in a few instructions with SSE4.1 (beware: code not tested):
//input = [100, 5, 1, 42]
__m128i swapped = _mm_shuffle_epi32(input, _MM_SHUFFLE(2,3,0,1)); // [5, 100, 42, 1]
__m128i comp = _mm_cmplt_epi32(input, swapped); // [0, -1, -1, 0]
comp = _mm_xor_si128(comp, _mm_set_epi32(-1, 0, -1, 0)); // [0, 0, -1, -1]
input = _mm_blendv_epi8(swapped, input, comp); // [5, 100, 1, 42]
It seems to be 7 uops and take 2 CPU cycles (throughput) on Ivy Bridge.
It can be easily ported to AVX2 if wanted.
For AVX2 capable systems there is a solution using min/max and blend with imm
(which has 1 cycle latency vs. 2 cycles for variable one).
Following code has 3 cycles latency and should have less than 2cycles throughput on HSW+
__m128i tmp = _mm_shuffle_epi32(in, _MM_SHUFFLE(2,3,0,1));
__m128i min = _mm_min_epi32(in, tmp);
__m128i max = _mm_max_epi32(in, tmp);
// __m128i res = _mm_blend_epi32(min, max, 0xA); // AVX2 only
__m128i res = _mm_blend_epi16(min, max, 0xCC); // SSE4.1
I've tested it on my HSW system (processing 20000 pairs 100K times) and it has ~26% better performance than the code by stgatilov
CMP + VARIABLE BLEND 1.18sec
MIN/MAX + BLEND_32 0.87sec // AVX2 only code
MIN/MAX + BLEND_PS 0.86sec // SSE
MIN/MAX + PLEND_16 0.88sec // Preferred for SSE
UPDATE: Per stgatilov
' comment below. All MIN/MAX implementations have virtually the same performance (most likely just stuck into memory b/w)