SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this.
i[0] = i[0] & 0b111111;
i[1] = (i[1]>>6) & 0b111111;
i[2] = (i[2]>>12) & 0b111111;
i[3] = (i[3]>>18) & 0b111111;
Essentially trying to isolate a different group of 6 bits in each integer.
So what is the optimal solution?
Things I thought about:
You can simulate a variable right shift, with a variable left shift and a uniform right shift. I thought about multiplying the packed integers by a different amount each (therefore simulating a left shift). Then with the result, you can do a uniform right shift get the answer. The problem with this the specific op I would use for the multiplication would be _mm_mullo_epi32
, which has disappointing latency (10 cycles for haswell), and given my program it would have to wait for the result cause this particular result is a dependency for the next instructions. Overall that method I think would only be a little faster than the brute force method, which is unpack, shift using scalar instructions, and then repack the vector, which would take about 20 cycles I think.
If AVX2 is available, this only takes a single efficient instruction. e.g. __m128i _mm_srlv_epi32 (__m128i a, __m128i count)
(vpsrlvd
), and the 256-bit versions. Variable-shifts of 32-bit and 64-bit elements by corresponding count elements are available in left, right-arithmetic, and right-logical. (Arithmetic right-shift isn't available with 64-bit element size.)
AVX512BW adds 16-bit variable-shifts.
Emulating it without AVX2:
What kind of dependency chain is this part of? Can you unroll and interleave so two vectors are in flight at once? Two long dep chains in parallel is much better than one long dep chain, if it's so long that the out-of-order window can't see next dep chain in the next loop iteration.
It could be worth making a separate AVX2 version of your function for use on Haswell and later CPUs (where you can use a variable-shift). If you do that, your function will only use pmulld
(mullo_epi32
) on CPUs where it's most efficient. (i.e. you avoid SSE4.1 mullo_epi32
on AVX2 CPUs, because it turns out those CPUs made that instruction slower.)
pmulld
looks like the best we can do for throughput and fused-domain uop count even on Haswell.
On SnB/IvB where it's a single uop for the vector-integer multiply unit, the whole function is only 2 uops / 6 cycle latency / one per 1c throughput. (Which is worse than I managed with shift/blend, so you'd only want to use pmulld
if throughput/code-size matters at all, and you aren't bottlenecked purely on latency. e.g. after unrolling.)
If your shift-counts are constants, and you have spare bits at the top of your register, you can multiply by powers of 2 and then use a fixed right-shift. Shift right every DW in a __m128i by a different amount. Knocking off high bits isn't a problem for your bit-field extract because you're ANDing to keep only the low few bits anyway.
// See the godbolt link below for a version of this with more comments
// SnB/IvB: 6c latency, 2 fused-domain uops.
__m128i isolate_successive_6bits_mul (__m128i input)
{
// We can avoid the AND if we shift the elements all the way to the left to knock off the high garbage bits.
// 32 - 6 - 18 = 8 extra bits to left shift
__m128i mul_constant = _mm_set_epi32(1<<(0+8), 1<<(6+8), 1<<(12+8), 1<<(18+8));
__m128i left_vshift = _mm_mullo_epi32(input, mul_constant);
__m128i rightshifted = _mm_srli_epi32(left_vshift, (18+8));
return rightshifted;
}
A smart way with blends:
(Unfortunately we don't have AVX2 vpblendd
for efficient dword blends that can run on any port. pblendw
is limited to port 5 on Intel CPUs. blendps
could be good for throughput (runs on any port) but can introduce bypass delays between integer instructions.)
Shift and blend so each element ends up with the right total shift count.
AND-mask the low 6 bits after combining everything into one vector.
Same latency as the brute-force way (see below) on Intel CPUs, and better throughput (because of fewer uops). Only two immediate-blends tying up port5 is nice. (AVX2 vpblendd
can run on any port, but then we'd just use vpsrlvd
.)
// seems to be optimal for Intel CPUs.
__m128i isolate_successive_6bits (__m128i input)
{ // input = [ D C B A ]
// output = [ D>>18 C>>12 B>>6 A ] & set1(0b111111)
__m128i right12 = _mm_srli_epi32(input, 12);
__m128i merged = _mm_blend_epi16(input, right12, 0xF0); // copy upper half, like `movhps` (but don't use that because of extra bypass delay)
// merged = [ D>>12 C>>12 B>>0 A>>0 ]
__m128i right6 = _mm_srli_epi32(merged, 6);
merged = _mm_blend_epi16(merged, right6, 0b11001100); // blend in the odd elements
// merged = [ D>>(12+6) C>>12 B>>(0+6) A>>0 ]
return _mm_and_si128(merged, _mm_set1_epi32(0b111111)); // keep only the low 6 bits
}
I put both versions on the Godbolt compiler explorer.
This version is only 5 uops, compiled with gcc 5.3 -O3 -march=ivybridge
:
# input in xmm0, result in xmm0
isolate_successive_6bits:
vpsrld xmm1, xmm0, 12 # starts on cycle 0, result ready for the start of cycle 1
vpblendw xmm0, xmm0, xmm1, 240 # cycle 1
vpsrld xmm1, xmm0, 6 # cycle 2
vpblendw xmm0, xmm0, xmm1, 204 # cycle 3
vpand xmm0, xmm0, XMMWORD PTR .LC0[rip] # cycle 4, result ready on cycle 5
ret
Every instruction is dependent on the previous, so it has 5c latency. SnB/IvB/HSW/BDW CPUs only have one shift port, so they can't take advantage of the parallelism available in the more brute-force version (which does three shifts with different shift counts). Skylake can, but then the two cycles of blending afterwards eat up the improvement.
The "brute force" way:
Do three shifts the three different shift counts, and use three immediate blends (pblendw
) to combine the four vectors into one that has each desired element.
// same latency as the previous version on Skylake
// slower on previous Intel SnB-family CPUs.
isolate_successive_6bits_parallel:
vpsrld xmm1, xmm0, 6 # cycle 0. SKL: c0
vpsrld xmm2, xmm0, 12 # cycle 1 (resource conflict on pre-Skylake). SKL: c0
vpblendw xmm1, xmm0, xmm1, 12 # cycle 2 (input dep). SKL: c1
vpsrld xmm3, xmm0, 18 # cycle 2. SKL: c1
vpblendw xmm0, xmm2, xmm3, 192 # cycle 3 (input dep). SKL: c2
vpblendw xmm0, xmm1, xmm0, 240 # cycle 4 (input dep). SKL: c3
vpand xmm0, xmm0, XMMWORD PTR .LC0[rip] # cycle 5 (input dep). SKL: c4.
ret
Doing the merging with a linear dependency chain instead of a tree means merging can finish sooner after the last shift result is ready:
isolate_successive_6bits_parallel2:
vpsrld xmm1, xmm0, 6 # c0. SKL:c0
vpsrld xmm2, xmm0, 12 # c1. SKL:c0
vpblendw xmm1, xmm0, xmm1, 12 # c2. SKL:c1
vpblendw xmm1, xmm1, xmm2, 48 # c3. SKL:c2
vpsrld xmm0, xmm0, 18 # c2. SKL:c1
vpblendw xmm0, xmm1, xmm0, 192 # c4. SKL:c3 (dep on xmm1)
vpand xmm0, xmm0, XMMWORD PTR .LC0[rip] # c5. SKL:c4
ret
Hmm, nope, doesn't help. No gain in latency for SnB to BDW, or for SKL. The first merge can happen after only one shift because the un-shifted input is what we need for one element. If element 0 needed a non-zero shift count, this way would have an advantage for pre-SKL, and maybe a disadvantage for SKL.