Is _mm_broadcast_ss faster than _mm_set1_ps?

2019-01-26 11:59发布

问题:

Is this code

float a = ...;
__m256 b = _mm_broadcast_ss(&a)

always faster than this code

float a = ...;
_mm_set1_ps(a)

?

What if a defined as static const float a = ... rather than float a = ...?

回答1:

mm_broadcast_ss is likely to be faster than mm_set1_ps. The former translates into a single instruction (VBROADCASTSS), while the latter is emulated using multiple instructions (probably a MOVSS followed by a shuffle). However, mm_broadcast_ss requires the AVX instruction set, while only SSE is required for mm_set1_ps.



回答2:

If you target AVX instruction set, gcc will use VBROADCASTSS to implement _mm_set1_ps intrinsic. Clang, however, will use two instructions (VMOVSS + VPSHUFD).



回答3:

_mm_broadcast_ss has weaknesses imposed by the architecture which are largely hidden by the mm SSE API. The most important difference is as follows:

  • _mm_broadcast_ss is limited to loading values from memory only.

What this means is if you use _mm_broadcast_ss explicitly in a situation where the source is not in memory then the result will likely be less efficient than that of using _mm_set1_ps. This sort of situation typically happens when loading immediate values (constants), or when using the result of a recent calculation. In those situations the result will be mapped to a register by the compiler. To use the value for broadcast, the compiler must dump the value back to memory. Alternatively, a pshufd could be used to splat directly from register instead.

_mm_set1_ps is implementation-defined rather than being mapped to a specific underlying cpu operation (instruction). That means it might use one of several SSE instructions to perform the splat. A smart compiler with AVX support enabled should definitely use vbroadcastss internally when appropriate, but it depends on the AVX implementation state of the compilers optimizer.

If you're very confident you're loading from memory -- such as iterating over an array of data -- then direct use of broadcast is fine. But if there's any doubt at all, I would recommend stick with _mm_set1_ps.

And in the specific case of a static const float, you absolutely want to avoid using _mm_broadcast_ss().