This code works in debug mode, but panics because of the assert in release mode.
use std::arch::x86_64::*;
fn main() {
unsafe {
let a = vec![2.0f32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0];
let b = -1.0f32;
let ar = _mm256_loadu_ps(a.as_ptr());
println!("ar: {:?}", ar);
let br = _mm256_set1_ps(b);
println!("br: {:?}", br);
let mut abr = _mm256_setzero_ps();
println!("abr: {:?}", abr);
abr = _mm256_fmadd_ps(ar, br, abr);
println!("abr: {:?}", abr);
let mut ab = [0.0; 8];
_mm256_storeu_ps(ab.as_mut_ptr(), abr);
println!("ab: {:?}", ab);
assert_eq!(ab[0], -2.0f32);
}
}
I can indeed confirm that this code causes the assert to trip in release mode:
This appears to be a compiler bug, see here and here. In particular, you are calling routines like
_mm256_set1_ps
and_mm256_fmadd_ps
, which require the CPU featuresavx
andfma
respectively, but neither your code nor your compilation command indicate to the compiler that such features should be used.One way of fixing this is to tell the compiler to compile the entire program with both the
avx
andfma
features enabled, like so:Another approach that achieves the same result is to tell the compiler to use all available CPU features on your CPU:
However, both of these compilation commands produce binaries that can only run on CPUs that support the
avx
andfma
features. If that's not a problem for you, then this is a fine solution. If you would instead like to build portable binaries, then you can perform CPU feature detection at runtime, and compile certain functions with specific CPU features enabled. It is then your responsibility to guarantee that said functions are only invoked when the corresponding CPU feature is enabled and available. This process is documented as part of the dynamic CPU feature detection section of thestd::arch
docs.Here's an example that uses runtime CPU feature detection:
To run it, you no longer need to set any compilation flags:
If you run the resulting binary on a CPU that doesn't support either
avx
orfma
, then the program should exit with an error message:unsupported CPU
.In general, I think the docs for
std::arch
could be improved. In particular, the key boundary at which you need to split your code is dependent upon whether your vector types appear in your function signature. That is, thedoit
routine does not require anything beyond the standard x86 (or x86_64) function ABI to call, and is thus safe to call from functions that don't otherwise supportavx
orfma
. However, internally, the function has been told to compile its code using additional instruction set extensions based on the given CPU features. This is achieved via thetarget_feature
attribute. If you, for example, supplied an incorrect target feature:then the program exhibits the same behavior as your initial program.