I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well as I want to be able at run time to switch between SIMD and not SIMD.
How would you suggest me to approach this problem? (Of course I don't want to implement the problem multiple times for all possible options)
I can see how this might not be very easy task with C++ but I believe that I'm missing something. So far my idea looks like this... A class cStream will be array of a single field. Using multiple cStreams I can achieve SoA (Structure of Arrays). Then using a few Functors I can fake Lambda function that I need to be executed over the whole cStream.
// just for example I'm not expecting this code to compile
cStream a; // something like float[1024]
cStream b;
cStream c;
void Foo()
{
for_each(
AssignSIMD(c, MulSIMD(AddSIMD(a, b), a)));
}
Where for_each will be responsible for incrementing the current pointer of the streams as well as inlining the functors' body with SIMD and without SIMD.
something like so:
// just for example I'm not expecting this code to compile
for_each(functor<T> f)
{
#ifdef USE_SIMD
if (simdEnabled)
real_for_each(f<true>()); // true means use SIMD
else
#endif
real_for_each(f<false>());
}
Notice that if the SIMD is enabled is checked once and that the loop is around the main functor.
Notice that the given example decides what to execute at compile time (since you're using the preprocessor), in this case you can use more complex techniques to decide what you actually want to execute; For example, Tag Dispatch: http://cplusplus.co.il/2010/01/03/tag-dispatching/ Following the example shown there, you could have the fast implementation be with SIMD, and the slow without.
You might want to take a glance at my attempt at SIMD/non-SIMD:
vrep, a templated base class with specializations for SIMD (note how it distinguishes between floats-only SSE, and SSE2, which introduced integer vectors.).
More useful v4f, v4i etc classes (subclassed via intermediate v4).
Of course it's far more geared towards 4-element vectors for rgba/xyz type calculations than SoA, so will completely run out of steam when 8-way AVX comes along, but the general principles might be useful.
The most impressive approach to SIMD-scaling I've seen is the RTFact ray-tracing framework: slides, paper. Well worth a look. The researchers are closely associated with Intel (Saarbrucken now hosts the Intel Visual Computing Institute) so you can be sure forward scaling onto AVX and Larrabee was on their minds.
Intel's Ct "data parallelism" template library looks quite promising too.
If someone is interested this is the dirty code I come with to test a new idea that I came with while reading about the library that Paul posted.
Thanks Paul!
Have you thought about using existing solutions like liboil? It implements lots of common SIMD operations and can decide at runtime whether to use SIMD/non-SIMD code (using function pointers assigned by an initialization function).
You might want to look at the source for the MacSTL library for some ideas in this area: www.pixelglow.com/macstl/