I was reading this on MSDN, and it says
You should not access the __m128i fields directly. You can, however, see these types in the debugger. A variable of type __m128i maps to the XMM[0-7] registers.
However, it doesn't explain why. Why is it? For example, is the following "bad":
void func(unsigned short x, unsigned short y)
{
__m128i a;
a.m128i_i64[0] = x;
__m128i b;
b.m128i_i64[0] = y;
// Now do something with a and b ...
}
Instead of doing the assignments like in the example above, should one use some sort of load
function?
The field m128i_i64
and family are Microsoft compiler specific extensions. They don't exist in most other compilers.
Nevertheless, they are useful for testing purposes.
The real reason for avoiding their use is performance. The hardware cannot efficiently access individual elements of a SIMD vector.
- There are no instructions that let you directly access individual elements. (SSE4.1 does, but it requires a compile-time constant index.)
- Going through memory may incur a very large penalty due to failure of store forwarding.
AVX and AVX2 doesn't extend the SSE4.1 instructions to allow accessing elements in a 256-bit vector. And as far as I can tell, AVX512 will not have it for 512-bit vectors.
Likewise, the set intrinsics (such as _mm256_set_pd()
) suffer the same issue. They are implemented either as a series of data shuffling operations. Or by going through memory and taking on the store forwarding stalls.
Which begs the question: Is there an efficient way to populate a SIMD vector from scalar components? (or separate a SIMD vector into scalar components)
Short Answer: Not really. When you use SIMD, you're expected to do a lot of the work in the vectorized form. So the initialization overhead should not matter.