Is it safe/possible/advisable to cast floats directly to __m128
if they are 16 byte aligned?
I noticed using _mm_load_ps
and _mm_store_ps
to "wrap" a raw array adds a significant overhead.
What are potential pitfalls I should be aware of?
EDIT :
There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128
instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps
instruction, probably falling back to some fail safe code path.
There are several ways to put
float
values into SSE registers; the following intrinsics can be used:The compiler will often create the same instructions no matter whether you state
_mm_set_ss(val)
or_mm_load_ss(&val)
- try it and disassemble your code.It can, in some cases, be advantageous to write
_mm_set_ss(*valptr)
instead of_mm_load_ss(valptr)
... depends on (the structure of) your code.Going by http://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx, it's possible but not safe or recommended.
And here's the reason why:
http://social.msdn.microsoft.com/Forums/en-US/vclanguage/thread/766c8ddc-2e83-46f0-b5a1-31acbb6ac2c5/
What makes you think that
_mm_load_ps
and_mm_store_ps
"add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).The obvious issue I can see is that you're than aliasing (referring to a memory location by more than one pointer type), which can confuse the optimiser. Typical issues with aliasing is that since the optimiser doesn't observe that you're modifying a memory location through the original pointer, it considers it to be unchanged.
Since you're obviously not using the optimiser to its full extent (or you'd be willing to rely on it to emit the correct SSE instructions) you'll probably be OK.
The problem with using the intrinsics yourself is that they're designed to operate on SSE registers, and can't use the instruction variants that load from a memory location and process it in a single instruction.
A few years have passed since the question was asked. To answer the question my experience shows:
YES
reinterpret_cast
-casting afloat*
into a__m128*
and vice versa is good as long as thatfloat*
is 16-byte-aligned - example (in MSVC 2012):