Should I use SIMD or vector extensions or somethin

I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question.

Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all is not really portable.

I know that I have more control when I use my own SSE code, but often this control is unnessary.

One big problem of SSE is the use of dynamic memory which is, with the help of memory pools and data oriented design, as much as possible limited.

Now to my question:

Should I use naked SSE? Perhaps encapsulated.

__m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f);
__m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4);

__m128 res = _mm_mul_ps(v1, v2);

Or should the compiler do the dirty work?

float v1 = {0.5f, 2, 4, 0.25f};
float v2 = {2, 0.5f, 0.25f, 4};

float res[4];
res[0] = v1[0]*v2[0];
res[1] = v1[1]*v2[1];
res[2] = v1[2]*v2[2];
res[3] = v1[3]*v2[3];

Or should I use SIMD with additional code? Like a dynamic container class with SIMD operations, which needs additional load and store instructions.
```
Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f);
Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4);

Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
```
The above example use a imaginary class with uses float[4] internal and uses store and load in each methods like multiplyElements(...). The methods uses SSE internal.

I don't want to use another library, because I want to learn more about SIMD and large scale software design. But library examples are welcome.

PS: This is not a real problem more a design question.

回答1:

Well, if you want to use SIMD extensions, a good approach is to use SSE intrinsics (of course stay by all means away from inline assembly, but fortunately you didn't list it as alternative, anyway). But for cleanliness you should encapsulate them in a nice vector class with overloaded operators:

struct aligned_storage
{
    //overload new and delete for 16-byte alignment
};

class vec4 : public aligned_storage
{
public:
    vec4(float x, float y, float z, float w)
    {
         data_[0] = x; ... data_[3] = w; //don't use _mm_set_ps, it will do the same, followed by a _mm_load_ps, which is unneccessary
    }
    vec4(float *data)
    {
         data_[0] = data[0]; ... data_[3] = data[3]; //don't use _mm_loadu_ps, unaligned just doesn't pay
    }
    vec4(const vec4 &rhs)
        : xmm_(rhs.xmm_)
    {
    }
    ...
    vec4& operator*=(const vec4 v)
    {
         xmm_ = _mm_mul_ps(xmm_, v.xmm_);
         return *this;
    }
    ...

private:
    union
    {
        __m128 xmm_;
        float data_[4];
    };
};

Now the nice thing is, due to the anonymous union (UB, I know, but show me a platform with SSE where this doesn't work) you can use the standard float array whenever neccessary (like operator[] or initialization (don't use _mm_set_ps)) and only use SSE when appropriate. With a modern inlining compiler the encapsulation comes at probably no cost (I was rather surprised how well VC10 optimized the SSE instructions for a bunch of computations with this vector class, no fear of unneccessary moves into temporary memory variables, as VC8 seemed to like even without encapsulation).

The only disadvantage is, that you need to take care of proper alignment, as unaligned vectors don't buy you anything and may even be slower than non-SSE. But fortunately the alignment requirement of the __m128 will propagate into the vec4 (and any surrounding class) and you just need to take care of dynamic allocation, which C++ has good means for. You just need to make a base class whose operator new and operator delete functions (in all flavours of course) are overloaded properly and from which your vector class will derive. To use your type with standard containers you of course also need to specialize std::allocator (and maybe std::get_temporary_buffer and std::return_temporary_buffer for the sake of completeness), as it will use the global operator new otherwise.

But the real disadvantage is, that you need to also care for the dynamic allocation of any class that has your SSE vector as member, which may be tedious, but can again be automated a bit by also deriving those classes from aligned_storage and putting the whole std::allocator specialization mess into a handy macro.

JamesWynn has a point that those operations often come together in some special heavy computation blocks (like texture filtering or vertex transformation), but on the other hand using those SSE vector encapsulations doesn't introduce any overhead over a standard float[4]-implementation of a vector class. You need to get those values from memory into registers anyway (be it the x87 stack or a scalar SSE register) in order to do any computations, so why not take em all at once (which should IMHO not be any slower than moving a single value if properly aligned) and compute in parallel. Thus you can freely switch out an SSE-inplementation for a non-SSE one without inducing any overhead (correct me if my reasoning is wrong).

But if the ensuring of alignment for all classes having vec4 as member is too tedious for you (which is IMHO the only disadvantage of this approach), you can also define a specialized SSE-vector type which you use for computations and use a standard non-SSE vector for storage.

EDIT: Ok, to look at the overhead argument, that goes around here (and looks quite reasonable at first), let's take a bunch of computations, which look very clean, due to overloaded operators:

#include "vec.h"
#include <iostream>

int main(int argc, char *argv[])
{
    math::vec<float,4> u, v, w = u + v;
    u = v + dot(v, w) * w;
    v = abs(u-w);
    u = 3.0f * w + v;
    w = -w * (u+v);
    v = min(u, w) + length(u) * w;
    std::cout << v << std::endl;
    return 0;
}

and see what VC10 thinks about it:

...
; 6   :     math::vec<float,4> u, v, w = u + v;

movaps  xmm4, XMMWORD PTR _v$[esp+32]

; 7   :     u = v + dot(v, w) * w;
; 8   :     v = abs(u-w);

movaps  xmm3, XMMWORD PTR __xmm@0
movaps  xmm1, xmm4
addps   xmm1, XMMWORD PTR _u$[esp+32]
movaps  xmm0, xmm4
mulps   xmm0, xmm1
haddps  xmm0, xmm0
haddps  xmm0, xmm0
shufps  xmm0, xmm0, 0
mulps   xmm0, xmm1
addps   xmm0, xmm4
subps   xmm0, xmm1
movaps  xmm2, xmm3

; 9   :     u = 3.0f * w + v;
; 10   :    w = -w * (u+v);

xorps   xmm3, xmm1
andnps  xmm2, xmm0
movaps  xmm0, XMMWORD PTR __xmm@1
mulps   xmm0, xmm1
addps   xmm0, xmm2

; 11   :    v = min(u, w) + length(u) * w;

movaps  xmm1, xmm0
mulps   xmm1, xmm0
haddps  xmm1, xmm1
haddps  xmm1, xmm1
sqrtss  xmm1, xmm1
addps   xmm2, xmm0
mulps   xmm3, xmm2
shufps  xmm1, xmm1, 0

; 12   :    std::cout << v << std::endl;

mov edi, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
mulps   xmm1, xmm3
minps   xmm0, xmm3
addps   xmm1, xmm0
movaps  XMMWORD PTR _v$[esp+32], xmm1
...

Even without thoroughly analyzing every single instruction and its use, I'm pretty confident to say that there aren't any unneccessary loads or stores, except the ones at the beginning (Ok, I left them uninitialized), which are neccessary anyway to get them from memory into computing registers, and at the end, which is neccessary as in the following expression v is gonna be put out. It didn't even store anything back into u and w, since they are only temporary variables which I don't use any further. Everything is perfectly inlined and optimized out. It even managed to seamlessly shuffle the result of the dot product for the following multiplication, without it leaving the XMM register, although the dot function returns a float using an actual _mm_store_ss after the haddpss.

So even I, being usually a bit oversuspicios of the compiler's abilities, have to say, that handcrafting your own intrinsics into special functions doesn't really pay compared to the clean and expressive code you gain by encapsulation. Though you may be able to create killer examples where handrafting the intrinics may indeed spare you some few instructions, but then again you first have to outsmart the optimizer.

EDIT: Ok, Ben Voigt pointed out another problem of the union besides the (most probably not problematic) memory layout incompatibility, which is that it is violating strict aliasing rules and the compiler may optimize instructions accessing different union members in a way that makes the code invalid. I haven't thought about that yet. I don't know if it makes any problems in practice, it certainly needs investigation.

If it really is a problem, we unfortunately need to drop the data_[4] member and use the __m128 alone. For initialization we now have to resort to _mm_set_ps and _mm_loadu_ps again. The operator[] gets a bit more complicated and might need some combination of _mm_shuffle_ps and _mm_store_ss. But for the non-const version you have to use some kind of proxy object delegating an assignment to the corresponding SSE instructions. It has to be investigated in which way the compiler can optimize this additional overhead in the specific situations then.

Or you only use the SSE-vector for computations and just make an interface for conversion to and from non-SSE vectors at a whole, which is then used at the peripherals of computations (as you often don't need to access individual components inside lengthy computations). This seems to be the way glm handles this issue. But I'm not sure how Eigen handles it.

But however you tackle it, there is still no need to handcraft SSE instrisics without using the benefits of operator overloading.

回答2:

I suggest that you learn about expression templates (custom operator implementations that use proxy objects). In this way, you can avoid doing performance-killing load/store around each individual operation, and do them only once for the entire computation.

回答3:

I would suggest using the naked simd code in a tightly controlled function. Since you won't be using it for your primary vector multiplication because of the overhead, this function should probably take the list of Vector3 objects that need to be manipulated, as per DOD. Where there's one, there is many.