I've seen a few articles describing how Vector<T>
is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowing much faster code than classic, linear loops (example here).
I decided to try to rewrite a method I have to see if I managed to get some speedup, but so far I failed and the vectorized code is running 3 times slower than the original, and I'm not exactly sure as to why. Here are two versions of a method checking if two Span<float>
instances have all the pairs of items in the same position that share the same position relative to a threshold value.
// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
for (int i = 0; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
return true;
}
// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
// Setup the test vector
int l = Vector<float>.Count;
float* arr = stackalloc float[l];
for (int i = 0; i < l; i++)
arr[i] = threshold;
Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
{
// Iterate in chunks
int
div = x1.Length / l,
mod = x1.Length % l,
i = 0,
offset = 0;
for (; i < div; i += 1, offset += l)
{
Vector<float>
v1 = Unsafe.Read<Vector<float>>(px1 + offset),
v1cmp = Vector.GreaterThan<float>(v1, cmp),
v2 = Unsafe.Read<Vector<float>>(px2 + offset),
v2cmp = Vector.GreaterThan<float>(v2, cmp);
float*
pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
for (int j = 0; j < l; j++)
if (pcmp1[j] == 0 != (pcmp2[j] == 0))
return false;
}
// Test the remaining items, if any
if (mod == 0) return true;
for (i = x1.Length - mod; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
}
return true;
}
As I said, I've tested both versions using BenchmarkDotNet, and the one using Vector<T>
is running around 3 times slower than the other one. I tried running the tests with spans of different length (from around 100 to over 2000), but the vectorized method keeps being much slower than the other one.
Am I missing something obvious here?
Thanks!
EDIT: the reason why I'm using unsafe code and trying to optimize this code as much as possible without parallelizing it is that this method is already being called from within a Parallel.For
iteration.
Plus, having the ability to parallelize the code over multiple threads is generally not a good reason to leave the individual parallel tasks not optimized.
** EDIT ** After reading a blog post by Marc Gravell, I see that this can be achieved simply...
Now this is not quite as quick as using array's directly (if that's what you have) but is still significantly faster than the non-SIMD version...
(Another edit...)
...and just for fun I thought I would see well this stuff handles works when fully generic, and the answer is very well... so you can write code like the following, and it is just as efficient as being specific (well except in the non-hardware accelerated case, in which case its a bit less than twice as slow - but not completely terrible...)
I had the same problem. The solution was to uncheck the Prefer 32-bit option at the project properties.
A vector is just a vector. It doesn't claim or guarantee that SIMD extensions are used. Use
System.Numerics.Vector2
https://docs.microsoft.com/en-us/dotnet/standard/numerics#simd-enabled-vector-types