I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question.
Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all is not really portable.
I know that I have more control when I use my own SSE code, but often this control is unnessary.
One big problem of SSE is the use of dynamic memory which is, with the help of memory pools and data oriented design, as much as possible limited.
Now to my question:
Should I use naked SSE? Perhaps encapsulated.
__m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f); __m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4); __m128 res = _mm_mul_ps(v1, v2);
Or should the compiler do the dirty work?
float v1 = {0.5f, 2, 4, 0.25f}; float v2 = {2, 0.5f, 0.25f, 4}; float res[4]; res[0] = v1[0]*v2[0]; res[1] = v1[1]*v2[1]; res[2] = v1[2]*v2[2]; res[3] = v1[3]*v2[3];
Or should I use SIMD with additional code? Like a dynamic container class with SIMD operations, which needs additional
load
andstore
instructions.Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f); Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4); Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
The above example use a imaginary class with uses
float[4]
internal and usesstore
andload
in each methods likemultiplyElements(...)
. The methods uses SSE internal.
I don't want to use another library, because I want to learn more about SIMD and large scale software design. But library examples are welcome.
PS: This is not a real problem more a design question.
I would suggest using the naked simd code in a tightly controlled function. Since you won't be using it for your primary vector multiplication because of the overhead, this function should probably take the list of Vector3 objects that need to be manipulated, as per DOD. Where there's one, there is many.
Well, if you want to use SIMD extensions, a good approach is to use SSE intrinsics (of course stay by all means away from inline assembly, but fortunately you didn't list it as alternative, anyway). But for cleanliness you should encapsulate them in a nice vector class with overloaded operators:
Now the nice thing is, due to the anonymous union (UB, I know, but show me a platform with SSE where this doesn't work) you can use the standard float array whenever neccessary (like
operator[]
or initialization (don't use_mm_set_ps
)) and only use SSE when appropriate. With a modern inlining compiler the encapsulation comes at probably no cost (I was rather surprised how well VC10 optimized the SSE instructions for a bunch of computations with this vector class, no fear of unneccessary moves into temporary memory variables, as VC8 seemed to like even without encapsulation).The only disadvantage is, that you need to take care of proper alignment, as unaligned vectors don't buy you anything and may even be slower than non-SSE. But fortunately the alignment requirement of the
__m128
will propagate into thevec4
(and any surrounding class) and you just need to take care of dynamic allocation, which C++ has good means for. You just need to make a base class whoseoperator new
andoperator delete
functions (in all flavours of course) are overloaded properly and from which your vector class will derive. To use your type with standard containers you of course also need to specializestd::allocator
(and maybestd::get_temporary_buffer
andstd::return_temporary_buffer
for the sake of completeness), as it will use the globaloperator new
otherwise.But the real disadvantage is, that you need to also care for the dynamic allocation of any class that has your SSE vector as member, which may be tedious, but can again be automated a bit by also deriving those classes from
aligned_storage
and putting the wholestd::allocator
specialization mess into a handy macro.JamesWynn has a point that those operations often come together in some special heavy computation blocks (like texture filtering or vertex transformation), but on the other hand using those SSE vector encapsulations doesn't introduce any overhead over a standard
float[4]
-implementation of a vector class. You need to get those values from memory into registers anyway (be it the x87 stack or a scalar SSE register) in order to do any computations, so why not take em all at once (which should IMHO not be any slower than moving a single value if properly aligned) and compute in parallel. Thus you can freely switch out an SSE-inplementation for a non-SSE one without inducing any overhead (correct me if my reasoning is wrong).But if the ensuring of alignment for all classes having
vec4
as member is too tedious for you (which is IMHO the only disadvantage of this approach), you can also define a specialized SSE-vector type which you use for computations and use a standard non-SSE vector for storage.EDIT: Ok, to look at the overhead argument, that goes around here (and looks quite reasonable at first), let's take a bunch of computations, which look very clean, due to overloaded operators:
and see what VC10 thinks about it:
Even without thoroughly analyzing every single instruction and its use, I'm pretty confident to say that there aren't any unneccessary loads or stores, except the ones at the beginning (Ok, I left them uninitialized), which are neccessary anyway to get them from memory into computing registers, and at the end, which is neccessary as in the following expression
v
is gonna be put out. It didn't even store anything back intou
andw
, since they are only temporary variables which I don't use any further. Everything is perfectly inlined and optimized out. It even managed to seamlessly shuffle the result of the dot product for the following multiplication, without it leaving the XMM register, although thedot
function returns afloat
using an actual_mm_store_ss
after thehaddps
s.So even I, being usually a bit oversuspicios of the compiler's abilities, have to say, that handcrafting your own intrinsics into special functions doesn't really pay compared to the clean and expressive code you gain by encapsulation. Though you may be able to create killer examples where handrafting the intrinics may indeed spare you some few instructions, but then again you first have to outsmart the optimizer.
EDIT: Ok, Ben Voigt pointed out another problem of the union besides the (most probably not problematic) memory layout incompatibility, which is that it is violating strict aliasing rules and the compiler may optimize instructions accessing different union members in a way that makes the code invalid. I haven't thought about that yet. I don't know if it makes any problems in practice, it certainly needs investigation.
If it really is a problem, we unfortunately need to drop the
data_[4]
member and use the__m128
alone. For initialization we now have to resort to_mm_set_ps
and_mm_loadu_ps
again. Theoperator[]
gets a bit more complicated and might need some combination of_mm_shuffle_ps
and_mm_store_ss
. But for the non-const version you have to use some kind of proxy object delegating an assignment to the corresponding SSE instructions. It has to be investigated in which way the compiler can optimize this additional overhead in the specific situations then.Or you only use the SSE-vector for computations and just make an interface for conversion to and from non-SSE vectors at a whole, which is then used at the peripherals of computations (as you often don't need to access individual components inside lengthy computations). This seems to be the way glm handles this issue. But I'm not sure how Eigen handles it.
But however you tackle it, there is still no need to handcraft SSE instrisics without using the benefits of operator overloading.
I suggest that you learn about expression templates (custom operator implementations that use proxy objects). In this way, you can avoid doing performance-killing load/store around each individual operation, and do them only once for the entire computation.