I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics.
Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.)
Can you recommend me a short substitute ?
UPDATE:
_mm256_slli_si256
is efficiently achieved with
_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)
or
_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)
for shifts larger than 16 bytes.
But the question remains for _mm256_srli_si256
.
Here is a function to bit shift left a ymm register using avx2. I use it to shift left by one, though it looks like it works for up to 63 bit shifts.
If the shift count is a multiple of 4 bytes,
vpermd
(_mm256_permutevar8x32_epi32
) with the right shuffle mask will do the trick with one instruction (or more, if you actually need to zero the shifted-in bytes instead of copying a different element over them).To support variable (multiple-of-4B) shift counts, you could load the control mask from a window into an array of
0 0 0 0 0 0 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0
or something, except that0
is just the bottom element, and doesn't zero things out. For more on this idea for generating a mask from a sliding window, see my answer on another question.This answer is pretty minimal, since
vpermd
doesn't directly solve the problem. I point it out as an alternative that might work in some cases where you're looking for a full vector shift.From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction,
_mm256_alignr_epi8
._mm256_slli_si256(A, N)
0 < N < 16
N = 16
16 < N < 32
_mm256_srli_si256(A, N)
0 < N < 16
N = 16
16 < N < 32