I am trying to implement and code on some files, some of which contain SIMD-calls. I have compiled this code on a server, running basically the same OS as my machine, yet i cant compile it.
This is the error:
make
g++ main.cpp -march=native -o main -fopenmp
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0,
from tensor.hpp:9,
from main.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h: In function ‘_ZN6TensorIdE8add_avx2ERKS0_._omp_fn.5’:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:447:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_add_pd(__m256d, __mmask8, __m256d, __m256d)’: target specific option mismatch
_mm256_mask_add_pd (__m256d __W, __mmask8 __U, __m256d __A,
^~~~~~~~~~~~~~~~~~
In file included from main.cpp:4:0:
tensor.hpp:228:33: note: called from here
res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0,
from tensor.hpp:9,
from main.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:610:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch
_mm256_mask_loadu_pd (__m256d __W, __mmask8 __U, void const *__P)
^~~~~~~~~~~~~~~~~~~~
In file included from main.cpp:4:0:
tensor.hpp:228:33: note: called from here
res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0,
from tensor.hpp:9,
from main.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h:610:1: error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch
_mm256_mask_loadu_pd (__m256d __W, __mmask8 __U, void const *__P)
^~~~~~~~~~~~~~~~~~~~
In file included from main.cpp:4:0:
tensor.hpp:228:33: note: called from here
res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Makefile:7: recipe for target 'main' failed
make: *** [main] Error 1
Googling the problem didnt really help, as all answers pointed things out, i allready do/tried.
Can somebody provide some background as to why it doesn´t work.
EDIT:
int main(){
#ifdef __AVX512F___
auto tt = createTensor();
auto tt2 = createTensor();
auto res = tt.addAVX512(tt2);
#endif
}
//This is in tensor.hpp
#ifdef __AVX512F__
Tensor<T> Tensor::addAVX512(_param_){
res = _mm256_mask_add_pd(tmp, 0xFF, _mm256_mask_loadu_pd(tmp, 0xFF, &elements[i]), _mm256_mask_loadu_pd(tmp, 0xFF, &a.elements[i]));
}
#endif
This it the gist of what happens ... i have encased all SIMDcalls in #ifdefs, etc.
GCC will only let you use intrinsics for instruction sets that are enabled for the compiler to use. e.g. a related question about an AVX1 intrinsic: inlining failed in call to always_inline '__m256d _mm256_broadcast_sd(const double*)'
These are
_mask_
versions of 256-bit intrinsics, they require AVX512VL.(My comments under the question about
-mavx
were wrong, I didn't notice the_mask
in the name or args, just the_mm256
.)You're probably compiling on KNL (Knight's Landing / Xeon Phi) on your server, which has AVX512F but not AVX512VL. So
-march=native
will set-mavx512f
. (Unlike Skylake-AVX512 which does have AVX512VL allowing use of cool new AVX512 stuff like masked instructions with narrower vectors.)And you've found a bug in your
tensor.hpp
, where you use AVX512VL intrinsics after only checking for__AVX512F__
instead of__AVX512VL__
. AVX512-anything implies 512F, so it doesn't need to check both.This is just pointless, you don't need to use the masked versions of these intrinsics if you're going to use constant all-ones masks. Use
_mm256_add_pd
like a normal person and only check for__AVX__
. Or use_mm512_add_pd
.I thought at first this was from TensorFlow, but (from your comments) that doesn't make sense. And it can't be that badly written. Merge-masking into 3 copies of the same
tmp
with an all-true mask just makes no sense; it looks like a silly way to introduce a false dependency if the compiler can't optimize away the mask=all-ones into an unmasked load.And also terrible C++ style: you have a variable called
__m256d tmp
as a global or class member?? It's not even a local dummy variable, it may exist somewhere the compiler can't fully optimize it away.