Is it possible to cast floats directly to __m128 i

2019-01-14 23:02发布

Is it safe/possible/advisable to cast floats directly to __m128 if they are 16 byte aligned?

I noticed using _mm_load_ps and _mm_store_ps to "wrap" a raw array adds a significant overhead.

What are potential pitfalls I should be aware of?

EDIT :

There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128 instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps instruction, probably falling back to some fail safe code path.

5条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-01-14 23:15

There are several ways to put float values into SSE registers; the following intrinsics can be used:

__m128 sseval;
float a, b, c, d;

sseval = _mm_set_ps(a, b, c, d);  // make vector from [ a, b, c, d ]
sseval = _mm_setr_ps(a, b, c, d); // make vector from [ d, c, b, a ]
sseval = _mm_load_ps(&a);         // ill-specified here - "a" not float[] ...
                                  // same as _mm_set_ps(a[0], a[1], a[2], a[3])
                                  // if you have an actual array

sseval = _mm_set1_ps(a);          // make vector from [ a, a, a, a ]
sseval = _mm_load1_ps(&a);        // load from &a, replicate - same as previous

sseval = _mm_set_ss(a);           // make vector from [ a, 0, 0, 0 ]
sseval = _mm_load_ss(&a);         // load from &a, zero others - same as prev

The compiler will often create the same instructions no matter whether you state _mm_set_ss(val) or _mm_load_ss(&val) - try it and disassemble your code.

It can, in some cases, be advantageous to write _mm_set_ss(*valptr) instead of _mm_load_ss(valptr) ... depends on (the structure of) your code.

查看更多
等我变得足够好
3楼-- · 2019-01-14 23:17

Going by http://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx, it's possible but not safe or recommended.

You should not access the __m128 fields directly.


And here's the reason why:

http://social.msdn.microsoft.com/Forums/en-US/vclanguage/thread/766c8ddc-2e83-46f0-b5a1-31acbb6ac2c5/

  1. Casting float* to __m128 will not work. C++ compiler converts assignment to __m128 type to SSE instruction loading 4 float numbers to SSE register. Assuming that this casting is compiled, it doesn't create working code, because SEE loading instruction is not generated.

__m128 variable is not actually variable or array. This is placeholder for SSE register, replaced by C++ compiler to SSE Assembly instruction. To understand this better, read Intel Assembly Programming Reference.

查看更多
家丑人穷心不美
4楼-- · 2019-01-14 23:33

What makes you think that _mm_load_ps and _mm_store_ps "add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).

查看更多
冷血范
5楼-- · 2019-01-14 23:34

The obvious issue I can see is that you're than aliasing (referring to a memory location by more than one pointer type), which can confuse the optimiser. Typical issues with aliasing is that since the optimiser doesn't observe that you're modifying a memory location through the original pointer, it considers it to be unchanged.

Since you're obviously not using the optimiser to its full extent (or you'd be willing to rely on it to emit the correct SSE instructions) you'll probably be OK.

The problem with using the intrinsics yourself is that they're designed to operate on SSE registers, and can't use the instruction variants that load from a memory location and process it in a single instruction.

查看更多
迷人小祖宗
6楼-- · 2019-01-14 23:36

A few years have passed since the question was asked. To answer the question my experience shows:

YES

reinterpret_cast-casting a float* into a __m128* and vice versa is good as long as that float* is 16-byte-aligned - example (in MSVC 2012):

__declspec( align( 16 ) ) float f[4];
return _mm_mul_ps( _mm_set_ps1( 1.f ), *reinterpret_cast<__m128*>( f ) );
查看更多
登录 后发表回答