I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle.
How can I do it with using AVX2 instructions?
I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle.
How can I do it with using AVX2 instructions?
There is a way to emulate this operation, but it is not very beautiful:
First - a clarification - the usual specification of Intel requires that the shuffle pattern be defined in bits 0-3 in each byte for each byte. Since you seek to do a cross lane shuffle, your shuffle pattern uses the bit 4 as well, to represent bytes located in location index above 15 in the YMM register.
Assumptions : what you want to shuffle is in YMM0, and the pattern is in YMM1.
The code is as below :
This also ensures that the pattern contained in YMM1 is untouched - as is true of VPSHUFB instruction.
Trust this helps...