This code works in debug mode, but panics because of the assert in release mode.
use std::arch::x86_64::*;
fn main() {
unsafe {
let a = vec![2.0f32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0];
let b = -1.0f32;
let ar = _mm256_loadu_ps(a.as_ptr());
println!("ar: {:?}", ar);
let br = _mm256_set1_ps(b);
println!("br: {:?}", br);
let mut abr = _mm256_setzero_ps();
println!("abr: {:?}", abr);
abr = _mm256_fmadd_ps(ar, br, abr);
println!("abr: {:?}", abr);
let mut ab = [0.0; 8];
_mm256_storeu_ps(ab.as_mut_ptr(), abr);
println!("ab: {:?}", ab);
assert_eq!(ab[0], -2.0f32);
}
}
(Playground)
I can indeed confirm that this code causes the assert to trip in release mode:
$ cargo run --release
Finished release [optimized] target(s) in 0.00s
Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-1.0, -1.0, -1.0, -1.0, 0.0, 0.0, 0.0, 0.0)
ab: [-1.0, -1.0, -1.0, -1.0, 0.0, 0.0, 0.0, 0.0]
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `-1.0`,
right: `-2.0`', src/main.rs:24:9
This appears to be a compiler bug, see here and here. In particular, you are calling routines like _mm256_set1_ps
and _mm256_fmadd_ps
, which require the CPU features avx
and fma
respectively, but neither your code nor your compilation command indicate to the compiler that such features should be used.
One way of fixing this is to tell the compiler to compile the entire program with both the avx
and fma
features enabled, like so:
$ RUSTFLAGS="-C target-feature=+avx,+fma" cargo run --release
Compiling so53831502 v0.1.0 (/tmp/so53831502)
Finished release [optimized] target(s) in 0.36s
Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
ab: [-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Another approach that achieves the same result is to tell the compiler to use all available CPU features on your CPU:
$ RUSTFLAGS="-C target-cpu=native" cargo run --release
Compiling so53831502 v0.1.0 (/tmp/so53831502)
Finished release [optimized] target(s) in 0.34s
Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
ab: [-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
However, both of these compilation commands produce binaries that can only run on CPUs that support the avx
and fma
features. If that's not a problem for you, then this is a fine solution. If you would instead like to build portable binaries, then you can perform CPU feature detection at runtime, and compile certain functions with specific CPU features enabled. It is then your responsibility to guarantee that said functions are only invoked when the corresponding CPU feature is enabled and available. This process is documented as part of the dynamic CPU feature detection section of the std::arch
docs.
Here's an example that uses runtime CPU feature detection:
use std::arch::x86_64::*;
use std::process;
fn main() {
if is_x86_feature_detected!("avx") && is_x86_feature_detected!("fma") {
// SAFETY: This is safe because we're guaranteed to support the
// necessary CPU features.
unsafe { doit(); }
} else {
eprintln!("unsupported CPU");
process::exit(1);
}
}
#[target_feature(enable = "avx,fma")]
unsafe fn doit() {
let a = vec![2.0f32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0];
let b = -1.0f32;
let ar = _mm256_loadu_ps(a.as_ptr());
println!("ar: {:?}", ar);
let br = _mm256_set1_ps(b);
println!("br: {:?}", br);
let mut abr = _mm256_setzero_ps();
println!("abr: {:?}", abr);
abr = _mm256_fmadd_ps(ar, br, abr);
println!("abr: {:?}", abr);
let mut ab = [0.0; 8];
_mm256_storeu_ps(ab.as_mut_ptr(), abr);
println!("ab: {:?}", ab);
assert_eq!(ab[0], -2.0f32);
}
To run it, you no longer need to set any compilation flags:
$ cargo run --release
Compiling so53831502 v0.1.0 (/tmp/so53831502)
Finished release [optimized] target(s) in 0.29s
Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
ab: [-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
If you run the resulting binary on a CPU that doesn't support either avx
or fma
, then the program should exit with an error message: unsupported CPU
.
In general, I think the docs for std::arch
could be improved. In particular, the key boundary at which you need to split your code is dependent upon whether your vector types appear in your function signature. That is, the doit
routine does not require anything beyond the standard x86 (or x86_64) function ABI to call, and is thus safe to call from functions that don't otherwise support avx
or fma
. However, internally, the function has been told to compile its code using additional instruction set extensions based on the given CPU features. This is achieved via the target_feature
attribute. If you, for example, supplied an incorrect target feature:
#[target_feature(enable = "ssse3")]
unsafe fn doit() {
// ...
}
then the program exhibits the same behavior as your initial program.