I've noticed that accessing __m128
fields by index is possible in gcc
, without using the union
trick.
__m128 t;
float r(t[0] + t[1] + t[2] + t[3]);
I can also load a __m128
just like an array:
__m128 t{1.f, 2.f, 3.f, 4.f};
This is all in line with gcc
's vector extensions. These, however, may not be available elsewhere. Are the loading and accessing features supported by the intel compiler and msvc?
To load a __m128
, you can write _mm_setr_ps(1.f, 2.f, 3.f, 4.f)
, which is supported by GCC, ICC, MSVC and clang.
So far as I know, clang and recent versions of GCC support accessing __m128
fields by index. I don't know how to do this in ICC or MSVC. I guess _mm_extract_ps
works for all 4 compilers but its return type is insane making it painful to use.
If you want you code to work on other compilers then don't use those GCC extensions. Use the set/load/store intrinsics. _mm_setr_ps
is fine for setting constant values but should not be used in a loop. To access elements I normally store the values to an array first then read the array.
If you have an array a
you should read/store it in with
__m128 t = _mm_loadu_ps(a);
_mm_storeu_ps(a, t);
If the array is 16-byte aligned you can use an aligned load/store which is slightly faster on newer systems but much faster on older systems.
__m128 t = _mm_load_ps(a);
_mm_store_ps(a, t);
To get 16-byte aligned memory on the stack use
__declspec(align(16)) const float a[] = ...//MSVC
__attribute__((aligned(16))) const float a[] ...//GCC, ICC
For 16-byte aligned dynamic arrays use:
float *a = (float*)_mm_malloc(sizeof(float)*n, 16); //MSVC, GCC, ICC, MinGW