In SSE3, the PALIGNR instruction performs the following:
PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination.
I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit.
Naively, I believed that the intrinsics function _mm256_alignr_epi8
(VPALIGNR) performs the same operation as _mm_alignr_epi8
only on 256bit registers. Sadly however, that is not exactly the case. In fact, _mm256_alignr_epi8
treats the 256bit register as 2 128bit registers and performs 2 "align" operations on the two neighboring 128bit registers. Effectively performing the same operation as _mm_alignr_epi8
but on 2 registers at once. It's most clearly illustrated here: _mm256_alignr_epi8
Currently my solution is to keep using _mm_alignr_epi8
by splitting the ymm (256bit) registers into two xmm (128bit) registers (high and low), like so:
__m128i xmm_ymm1_hi = _mm256_extractf128_si256(ymm1, 0);
__m128i xmm_ymm1_lo = _mm256_extractf128_si256(ymm1, 1);
__m128i xmm_ymm2_hi = _mm256_extractf128_si256(ymm2, 0);
__m128i xmm_ymm_aligned_lo = _mm_alignr_epi8(xmm_ymm1_lo, xmm_ymm1_hi, 1);
__m128i xmm_ymm_aligned_hi = _mm_alignr_epi8(xmm_ymm2_hi, xmm_ymm1_lo, 1);
__m256i xmm_ymm_aligned = _mm256_set_m128i(xmm_ymm_aligned_lo, xmm_ymm_aligned_hi);
This works, but there has to be a better way, right?
Is there a perhaps more "general" AVX2 instruction that should be using to get the same result?
What are you using palignr
for? If it's only to handle data misalignment, simply use misaligned loads instead; they are generally "fast enough" on modern Intel µ-architectures (and will save you a lot of code size).
If you need palignr
-like behavior for some other reason, you can simply take advantage of the unaligned load support to do it in a branch-free manner. Unless you're totally load-store bound, this is probably the preferred idiom.
static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
// Do whatever your compiler needs to make this buffer 64-byte aligned.
// You want to avoid the possibility of a page-boundary crossing load.
char buffer[64];
// Two aligned stores to fill the buffer.
_mm256_store_si256((__m256i *)&buffer[0], v0);
_mm256_store_si256((__m256i *)&buffer[32], v1);
// Misaligned load to get the data we want.
return _mm256_loadu_si256((__m256i *)&buffer[n]);
}
If you can provide more information about how exactly you're using palignr
, I can probably be more helpful.
We need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.
See: https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations
The only solution I was able to come up with for this is:
static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
if (n < 16)
{
__m128i v0h = _mm256_extractf128_si256(v0, 0);
__m128i v0l = _mm256_extractf128_si256(v0, 1);
__m128i v1h = _mm256_extractf128_si256(v1, 0);
__m128i vouth = _mm_alignr_epi8(v0l, v0h, n);
__m128i voutl = _mm_alignr_epi8(v1h, v0l, n);
__m256i vout = _mm256_set_m128i(voutl, vouth);
return vout;
}
else
{
__m128i v0h = _mm256_extractf128_si256(v0, 1);
__m128i v0l = _mm256_extractf128_si256(v1, 0);
__m128i v1h = _mm256_extractf128_si256(v1, 1);
__m128i vouth = _mm_alignr_epi8(v0l, v0h, n - 16);
__m128i voutl = _mm_alignr_epi8(v1h, v0l, n - 16);
__m256i vout = _mm256_set_m128i(voutl, vouth);
return vout;
}
}
which I think is pretty much identical to your solution except it also handles shifts of >= 16 bytes.