If I have an AVX register with 4 doubles in them and I want to store the reverse of this in another register, is it possible to do this with a single intrinsic command?
For example: If I had 4 floats in a SSE register, I could use:
_mm_shuffle_ps(A,A,_MM_SHUFFLE(0,1,2,3));
Can I do this using, maybe _mm256_permute2f128_pd()
? I don't think you can address each individual double using the above intrinsic.
You actually need 2 permutes to do this:
_mm256_permute2f128_pd()
only permutes in 128-bit chunks.
_mm256_permute_pd()
does not permute across 128-bit boundaries.
So you need to use both:
inline __m256d reverse(__m256d x){
x = _mm256_permute2f128_pd(x,x,1);
x = _mm256_permute_pd(x,5);
return x;
}
Test:
int main(){
__m256d x = _mm256_set_pd(13,12,11,10);
cout << x.m256d_f64[0] << " " << x.m256d_f64[1] << " " << x.m256d_f64[2] << " " << x.m256d_f64[3] << endl;
x = reverse(x);
cout << x.m256d_f64[0] << " " << x.m256d_f64[1] << " " << x.m256d_f64[2] << " " << x.m256d_f64[3] << endl;
}
Output:
10 11 12 13
13 12 11 10
With AVX2: VPERMPD ymm1, ymm2/m256, imm8
runs with the same throughput and latency as other lane-crossing shuffles (like VPERM2F128
) on Intel CPUs. (On AMD Excavator, if these numbers are right, vperm2f128
is slower than a single vpermpd
).
FMA is a separate feature bit from AVX2, but in practice there aren't any CPUs with FMA3 but not AVX2. (AMD Bulldozer-family has 4-operand FMA4). So you should still check both the AVX2 and FMA feature bits, but you don't have to worry about your function being usable on fewer CPU models.
So if your code already depends on FMA or AVX2, then use AVX2:
_mm256_permute4x64_pd(vec, _MM_SHUFFLE(0,1,2,3)); // i.e. 0b00011011
If you don't already depend on FMA or AVX2, just AVX, and it's not worth making yet another version of your function just for a small gain in shuffle performance, then use Mysticial's two-instruction solution for compatibility with SnB/IvB, and AMD Bulldozer-family pre-Excavator.