Does anyone know an open-source C++ x86 SIMD intrinsics library?
Intel supplies exactly what I need in their integrated performance primitives library, but I can't use that because of the copyrights all over the place.
EDIT
I already know the intrinsics provided by the compilers. What I need is a convenient interface to use them.
There are several libraries that have emerged in recent years to abstract explicit SIMD programming. The most important ones:
The most important thing to look for is to have a usable set of types that correctly abstract the best available SIMD registers and instructions for a given target. And, obviously, full portability to systems without SIMD support.
Vc is another C++ library that implements vector classes and allows writing vectorized code that is independent from the actual instruction set that is used.
Take a look at libsimdpp header-only C++ SIMD wrapper library.
The library supports several instruction sets via single interface: SSE2, SSE3, SSSE3, SSE4.1, AVX, AVX2, AVX512F, XOP, FMA3/4, NEON, NEONv2, Altivec. All of Clang, GCC, MSVC and ICC are suported.
Any differences between instruction sets are resolved by implementing the missing instructions as a combination of supported ones. As a bonus, it's possible to compile the same code for several instruction sets, link the resulting object files to a single executable and use a convenient dynamic dispatch mechanism to run the implementation most tailored to the current processor.
I wrote a GLSL-style library that will convert to near-perfect quality ASM code.
A very common operation - cross product:
would be converted to this assemly code using glsl-sse2:
Please note the library isn't perfect yet, and most likely have unfound bugs as it is still new.
You might want to look at macstl - although it was originally developed for the Mac (and PowerPC) it now works on Linux and x86 too.
Also, if you're working with images then look at OpenCV - this has SSE-optimised routines for many common image processing tasks and has C and C++ APIs.
Have a look at AMD's SSEPlus project, might be what your after