I will ask my question by giving an example. Now I have a function called do_something()
.
It has three versions: do_something()
, do_something_sse3()
, and do_something_sse4()
. When my program runs, it will detect the CPU feature (see if it supports SSE3 or SSE4) and call one of the three versions accordingly.
The problem is: When I build my program with GCC, I have to set -msse4
for do_something_sse4()
to compile (e.g. for the header file <smmintrin.h>
to be included).
However, if I set -msse4
, then gcc is allowed to use SSE4 instructions, and some intrinsics in do_something_sse3()
is also translated to some SSE4 instructions. So if my program runs on CPU that has only SSE3 (but no SSE4) support, it causes "illegal instruction" when calls do_something_sse3()
.
Maybe I have some bad practice. Could you give some suggestions? Thanks.
I think that the Mystical's tip is fine, but if you really want to do it in the one file, you can use proper pragmas, for instance:
#pragma GCC target("sse4.1")
GCC 4.4 is needed, AFAIR.
I think you want to build what's called a "CPU dispatcher". I got one working (as far as I know) for GCC but have not got it to work with Visual Studio.
cpu dispatcher for visual studio for AVX and SSE
I would check out Agner Fog's vectorclass and the file dispatch_example.cpp
http://www.agner.org/optimize/#vectorclass
g++ -O3 -msse2 -c dispatch_example.cpp -od2.o
g++ -O3 -msse4.1 -c dispatch_example.cpp -od5.o
g++ -O3 -mavx -c dispatch_example.cpp -od8.o
g++ -O3 -msse2 instrset_detect.cpp d2.o d5.o d8.o
Here is an example of compiling a separate object file for each optimization setting:
http://notabs.org/lfsr/software/index.htm
But even this method fails when gcc link time optimization (-flto) is used. So how can a single executable be built with full optimization for different processors? The only solution I can find is to use include directives to make the C files behave as a single compilation unit so that -flto is not needed. Here is an example using that method:
http://notabs.org/blcutil/index.htm
If you are using GCC 4.9 or above on an i686 or x86_64 machine, then you are supposed to be able to use intrinsics regardless of your -march=XXX
and -mXXX
options. You could write your do_something()
accordingly:
void do_something()
{
byte temp[18];
if (HasSSE2())
{
const __m128i i = _mm_loadu_si128((const __m128i*)(ptr));
...
}
else if (HasSSSE3())
{
const __m128i MASK = _mm_set_epi8(12,13,14,15, 8,9,10,11, 4,5,6,7, 0,1,2,3);
_mm_storeu_si128(reinterpret_cast<__m128i*>(temp),
_mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)(ptr)), MASK));
}
else
{
// Do the byte swap/endian reversal manually
...
}
}
You have to supply HasSSE2()
, HasSSSE3()
and friends. Also see Intrinsics for CPUID like informations?.
Also see GCC Issue 57202 - Please make the intrinsics headers like immintrin.h be usable without compiler flags. But I don't believe the feature works. I regularly encounter compile failures because GCC does not make intrinsics available.