I'm trying to understand possibly bypass delays when switching domains of execution units.
For example, the following two lines of code give exactly the same result.
_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
_mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
Which line of code is better to use?
The assembly output for the first line gives:
vpslldq xmm1, xmm0, 8
vaddps xmm0, xmm1, xmm0
The assembly output for the second line gives:
vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64 ; 00000040H
vaddps xmm2, xmm1, XMMWORD PTR [rcx]
Now if I look at Agner Fog's microarchitecture manual he gives an example on page 112 of using a integer shuffle (pshufd) on float values versus using a float shuffle (shufps) on float values. Switching domains adds 4 extra clock cycles so the solution using shufps is better.
The first line of code I listed using _mm_slli_si128
has to switch domains between integer and float vectors. The second using _mm_shuffle_ps
stays in the same domain. Doesn't this imply that the second line of code is the better solution?