I created a simple class to benchmark some methods of mine. But is it accurate? I am kind of new to benchmarking, timing, et cetera, so thought I could ask for some feedback here. Also, if it is good, maybe somebody else can make use of it as well :)
public static class Benchmark
{
public static IEnumerable<long> This(Action subject)
{
var watch = new Stopwatch();
while (true)
{
watch.Reset();
watch.Start();
subject();
watch.Stop();
yield return watch.ElapsedTicks;
}
}
}
You can use it like this:
var avg = Benchmark.This(() => SomeMethod()).Take(500).Average();
Any feedback? Does it look to be pretty stable and accurate, or have I missed something?
It is about as accurate as you can get for a simple benchmark. But there are some factors not under your control:
- load on the system from other processes
- state of the heap before/during the benchmark
You could do something about that last point, a benchmark is one of the rare situations where calling GC.Collect
can be defended. And you might call subject
once beforehand to eliminate any JIT issues. But that requires calls to subject
to be independent.
public static IEnumerable<TimeSpan> This(Action subject)
{
subject(); // warm up
GC.Collect(); // compact Heap
GC.WaitForPendingFinalizers(); // and wait for the finalizer queue to empty
var watch = new Stopwatch();
while (true)
{
watch.Reset();
watch.Start();
subject();
watch.Stop();
yield return watch.Elapsed; // TimeSpan
}
}
For bonus, your class should check the System.Diagnostics.Stopwatch.IsHighResolution
field. If it is off, you only have a very coarse (20 ms) resolution.
But on an ordinary PC, with many services running in the background, it is never going to be very accurate.
Couple problems here.
First, remember that the first time you run the code, the transitive closure of its method calls will be jitted. That means that the first run is likely to have higher cost than every subsequent run. Depending on whether you are benchmarking "cold" timings or "hot" timings, this could make a difference. I have seen methods where the cost of jitting the method was higher than every other call to it put together!
Second, remember that the garbage collector runs on another thread. If you are making garbage in one run, then the cost of cleaning up that garbage might not be realized until suebsequent runs. You are therefore failing to account for the total cost of one run, by foisting it off onto later runs.
Both of these are indicative of the weakness of all benchmarking: benchmarking is by nature unrealistic, and therefore of limited value. In real-world code, the GC is going to be running, the jitter is going to be running, and so on. It is frequently the case that benchmarked performance is nothing at all like real-world performance because the benchmark does not take into account the variability of real-world costs inherent in a large system. Rather than analyzing perf characteristics in isolation, I prefer to look at the perf characteristics of realistic scenarios actually faced by real customers.
You should definitely return ElapsedMilliseconds instead of ElapsedTicks. The value returned by ElapsedTicks is dependent upon the Stopwatch frequency, which can be different on different systems. It will not necessarily correspond to the Ticks property of a Timespan or DateTime object.
See http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.elapsedticks.aspx.
If you do want the extra resolution of Ticks, you should return watch.Elapsed.Ticks
(i.e. Timestamp.Ticks) instead of watch.ElapsedTicks
(this might be one of the subtlest potential errors in .Net). From MSDN:
Stopwatch ticks are different from
DateTime.Ticks. Each tick in the
DateTime.Ticks value represents one
100-nanosecond interval. Each tick in
the ElapsedTicks value represents the
time interval equal to 1 second
divided by the Frequency.
Other than that, I guess your code is fine, although I think you'd be including some of the method-calling overhead in your measurements, which might be significant if the methods themselves take very little time to execute. Also, you probably would want to exclude the first call to the method from your calculated average, but I'm not sure how you'd do that in your class.
One last point, which would probably not be relevant to most uses of this class: Stopwatch runs a bit fast compared to the system time. On my computer, it gets about 5 seconds (that's seconds, not milliseconds) ahead after 24 hours, and on other machines this drift can be even larger. So it's a little misleading to say it's highly accurate, when it's actually just highly granular. For timing short-duration methods, this obviously wouldn't be a significant problem.
And one more last point, which certainly is relevant: I've often noticed while benchmarking that I'll get a bunch of running times that are all clustered within a narrow range of values (e.g. 80, 80, 79, 82 etc.), but occasionally something else will happen in Windows (like opening another program or my anti-virus kicks on or something) and I'll get a value wildly out of whack with the others (e.g. 80, 80, 79, 271, 80 etc.). I think a simple solution to this outlier problem is to use the median of your measurements instead of the mean. I don't know if Linq supports this automatically or not.
As I am not a C# programmer, I cannot say with any deal of accuracy whether that class is an appropriate implementation for counting how long a function execution takes. However, there are things to keep in mind for repeatability and accuracy.
I am not up on the various ins and outs of the .NET Framework, but depending on how it compiles to native code, it might be possible that any compilation would affect benchmark results. Also, whether or not a function is in the cache can make a difference, too. So you'll want to loop over your function to ensure that there is no hit from compilation and that everything is loaded and ready. Once that's done, you might be able to get started.
Others will probably have better information and knowledge of .NET than I do.