Reverse a AVX register containing doubles using a

2019-01-28 10:38发布

问题:

If I have an AVX register with 4 doubles in them and I want to store the reverse of this in another register, is it possible to do this with a single intrinsic command?

For example: If I had 4 floats in a SSE register, I could use:

_mm_shuffle_ps(A,A,_MM_SHUFFLE(0,1,2,3));

Can I do this using, maybe _mm256_permute2f128_pd()? I don't think you can address each individual double using the above intrinsic.

回答1:

You actually need 2 permutes to do this:

  • _mm256_permute2f128_pd() only permutes in 128-bit chunks.
  • _mm256_permute_pd() does not permute across 128-bit boundaries.

So you need to use both:

inline __m256d reverse(__m256d x){
    x = _mm256_permute2f128_pd(x,x,1);
    x = _mm256_permute_pd(x,5);
    return x;
}

Test:

int main(){
    __m256d x = _mm256_set_pd(13,12,11,10);

    cout << x.m256d_f64[0] << "  " << x.m256d_f64[1] << "  " << x.m256d_f64[2] << "  " << x.m256d_f64[3] << endl;
    x = reverse(x);
    cout << x.m256d_f64[0] << "  " << x.m256d_f64[1] << "  " << x.m256d_f64[2] << "  " << x.m256d_f64[3] << endl;
}

Output:

10  11  12  13
13  12  11  10


回答2:

With AVX2: VPERMPD ymm1, ymm2/m256, imm8 runs with the same throughput and latency as other lane-crossing shuffles (like VPERM2F128) on Intel CPUs. (On AMD Excavator, if these numbers are right, vperm2f128 is slower than a single vpermpd).

FMA is a separate feature bit from AVX2, but in practice there aren't any CPUs with FMA3 but not AVX2. (AMD Bulldozer-family has 4-operand FMA4). So you should still check both the AVX2 and FMA feature bits, but you don't have to worry about your function being usable on fewer CPU models.


So if your code already depends on FMA or AVX2, then use AVX2:

_mm256_permute4x64_pd(vec, _MM_SHUFFLE(0,1,2,3));  // i.e. 0b00011011

If you don't already depend on FMA or AVX2, just AVX, and it's not worth making yet another version of your function just for a small gain in shuffle performance, then use Mysticial's two-instruction solution for compatibility with SnB/IvB, and AMD Bulldozer-family pre-Excavator.