I will ask my question by giving an example. Now I have a function called do_something()
.
It has three versions: do_something()
, do_something_sse3()
, and do_something_sse4()
. When my program runs, it will detect the CPU feature (see if it supports SSE3 or SSE4) and call one of the three versions accordingly.
The problem is: When I build my program with GCC, I have to set -msse4
for do_something_sse4()
to compile (e.g. for the header file <smmintrin.h>
to be included).
However, if I set -msse4
, then gcc is allowed to use SSE4 instructions, and some intrinsics in do_something_sse3()
is also translated to some SSE4 instructions. So if my program runs on CPU that has only SSE3 (but no SSE4) support, it causes "illegal instruction" when calls do_something_sse3()
.
Maybe I have some bad practice. Could you give some suggestions? Thanks.
I think that the Mystical's tip is fine, but if you really want to do it in the one file, you can use proper pragmas, for instance:
GCC 4.4 is needed, AFAIR.
I think you want to build what's called a "CPU dispatcher". I got one working (as far as I know) for GCC but have not got it to work with Visual Studio.
cpu dispatcher for visual studio for AVX and SSE
I would check out Agner Fog's vectorclass and the file dispatch_example.cpp http://www.agner.org/optimize/#vectorclass
If you are using GCC 4.9 or above on an i686 or x86_64 machine, then you are supposed to be able to use intrinsics regardless of your
-march=XXX
and-mXXX
options. You could write yourdo_something()
accordingly:You have to supply
HasSSE2()
,HasSSSE3()
and friends. Also see Intrinsics for CPUID like informations?.Also see GCC Issue 57202 - Please make the intrinsics headers like immintrin.h be usable without compiler flags. But I don't believe the feature works. I regularly encounter compile failures because GCC does not make intrinsics available.
Here is an example of compiling a separate object file for each optimization setting: http://notabs.org/lfsr/software/index.htm
But even this method fails when gcc link time optimization (-flto) is used. So how can a single executable be built with full optimization for different processors? The only solution I can find is to use include directives to make the C files behave as a single compilation unit so that -flto is not needed. Here is an example using that method: http://notabs.org/blcutil/index.htm