I'm interested in how efficient low-level algorithms can be in .net. I would like to enable us to choose to write more of our code in C# rather than C++ in the future, but one stumbling block is the bounds checking in .net that occurs with looping and random access to arrays.
A motivating example is a function that calculates the sum of products of corresponding elements in two arrays (this is the dot product of two vectors).
static void SumProduct(double[] X, double[] Y)
{
double sum = 0;
int length = X.Length;
if (length != Y.Length)
throw new ArgumentException("X and Y must be same size");
for (int i = 0; i < length; i++) // Check X.Length instead? See below
sum += X[i] * Y[i];
}
From what I can tell, and don't know enough IL or x86 to check, the compiler won't optimize out bounds checking of X
and Y
. Am I wrong and/or is there a way to write my code to allow the compiler to help me out?
Further details
There are many efficiency-arguments for and against using particular languages, not least that it is better to concentrate on "big O" algorithmic cost rather than the constant of proportionality, and higher level languages help you to do this. On the subject of bounds checking in .net, the best article I found is Array Bounds Check Elimination in the CLR on MSDN (also referenced in a stack overflow answer on the importance of enabling optimization).
This dates from 2009, so I wonder whether things have changed significantly since then. Also, the article reveals some real subtleties that would have caught me out so for this reason alone I would welcome some expert advice.
For example it appears that in my code above I would have better off writing i< X.Length
rather than i < length
. Also, I had also naively assumed that for an algorithm with a single array, writing a foreach
loop would better declare your intent to the compiler and give it the best chance of optimizing out the bounds checking.
According to the MSDN article, SumForBAD
, below, which I thought was sure to be optimized, would not be. Whereas SumFor
would be straightforwardly optimized, and SumForEach
would also be optimized, but not trivially (and might not be optimized at all if the array were passed into a function as IEnumerable<int>
)?
static double SumForBAD(double[] X)
{
double sum = 0;
int length = X.Length; // better to use i < X.length in loop
for (int i = 0; i < length; i++)
sum += X[i];
return sum;
}
static double SumFor(double[] X)
{
double sum = 0;
for (int i = 0; i < X.Length; i++)
sum += X[i];
return sum;
}
static double SumForEach(double[] X)
{
double sum = 0;
foreach (int element in X)
sum += element;
return sum;
}
I did some investigation based on doug65536's answer. In C++, I compared the times of a SumProduct that does one bounds-check
for(int i=0; i<n; ++i) sum += v1[i]*v2[i];
against another version that does two bounds-checks
for(int i=0; i<n1 && i <n2; ++i) sum += v1[i]*v2[i];
I found that the second version was slower, but only by about 3.5% (Visual Studio 2010, optimized build, default options). However it occurred to me that in C#, there might be three bounds checks. One explicit (i < length
in the function static void SumProduct(double[] X, double[] Y)
at the start of this question), and two implicit (X[i]
and Y[i]
). So I tested a third C++ function, with three bounds checks
for(int i=0; i<n1 && i <n2 && i <n3; ++i) sum += v1[i]*v2[i];
This came in 35% slower than the first, which is worth caring about. I did some more investigation in this question, Why does adding extra check in loop make big difference on some machines, and small difference on others?. Interestingly, it seems as though the cost of bounds checking varies significantly on different machines.
The bounds check won't matter because:
The bounds check consists of a
cmp
/jae
instruction pair, which is fused into a single micro-op on modern CPU architectures (the term is "macro-op fusion"). Compare and branch is very highly optimized.The bounds check is a forward branch, which will be statically predicted to be not-taken, also reducing the cost. The branch will never be taken. (If it ever is taken, an exception will throw anyway, so the mispredict cost becomes utterly irrelevant)
As soon as there is any memory delay, speculative execution will queue up many iterations of the loop, so the cost of decoding the extra instruction pair almost disappears.
Memory access will likely be your bottleneck, so the effect micro-optimizations like removing bounds checks will disappear.
First of all, I would like to thank everyone who spoken out in this post, from original OP to the guys who provided extremely detailed and insightful explanations. I really, really enjoyed reading the existing answers. Since there is already plentiful of theory of how and why the loops work in the way they do, I would like to offer some empirical (by some definition authoritative) measurements:
Conclusions:
.Length
property.unsafe fixed
is not faster than normal For loop.Benchmarking code:
Results: (uses Mono runtime)
One way to be sure that bounds checking is not performed is to use pointers, which you can do in C# in unsafe mode (this requires you to set a flag in the project properties):
I tried measuring your original method, your method with the
X.Length
change and my code using pointers, compiled both as x86 and x64 under .Net 4.5. Specifically, I tried computing the method for vectors of length 10 000 and ran the method 10 000 times.The results are pretty much in line with Michael Liu's answer: there is no measurable difference between the three methods, which means that bounds checking either isn't done or that its effect on performance is insignificant. There was measurable difference between x86 and x64 though: x64 was about 34 % slower.
Full code I used:
64-bit
The 64-bit jitter does a good job of eliminating bounds checks (at least in straightforward scenarios). I added
return sum;
at the end of your method and then compiled the program using Visual Studio 2010 in Release mode. In the disassembly below (which I annotated with a C# translation), notice that:X
, even though your code comparesi
againstlength
instead ofX.Length
. This is an improvement over the behavior described in the article.Y.Length >= X.Length
.Disassembly
32-bit
The 32-bit jitter, unfortunately, is not quite as smart. In the disassembly below, notice that:
X
, even though your code comparesi
againstlength
instead ofX.Length
. Again, this is an improvement over the behavior described in the article.Y
.Disassembly
Summing Up
The jitter has improved since 2009, and the 64-bit jitter can generate more efficient code than the 32-bit jitter.
If necessary, though, you can always bypass array bounds checks completely by using unsafe code and pointers (as svick points out). This technique is used by some performance-critical code in the Base Class Library.