The speed of .NET in numerical computing

2019-03-09 03:46发布

问题:

In my experience, .NET is 2 to 3 times slower than native code. (I implemented L-BFGS for multivariate optimization).

I have traced the ads on stackoverflow to http://www.centerspace.net/products/

the speed is really amazing, the speed is close to native code. How can they do that? They said that:

Q. Is NMath "pure" .NET?

A. The answer depends somewhat on your definition of "pure .NET". NMath is written in C#, plus a small Managed C++ layer. For better performance of basic linear algebra operations, however, NMath does rely on the native Intel Math Kernel Library (included with NMath). But there are no COM components, no DLLs--just .NET assemblies. Also, all memory allocated in the Managed C++ layer and used by native code is allocated from the managed heap.

Can someone explain more to me?

回答1:

The point about C++/CLI is correct. To complete the picture, just two additional interesting points:

  • .NET memory management (garbage collector) obviously is not the problem here, as NMath still depends on it

  • The performance advantage is actually provided by Intel MKL, which offers implementations extremely optimized for many CPUs. From my point of view, this is the crucial point. Using straight-forward, naiv C/C++ code wont necessarily give you superior performance over C#/.NET, it's sometimes even worse. However C++/CLI allows you to exploit all the "dirty" optimization options.



回答2:

How can they do that?

Like most of the numerical libraries for .NET, NMath is little more than a wrapper over an Intel MKL embedded in the .NET assembly, probably by linking with C++/CLI to create a mixed assembly. You've probably just benchmarked those bits that are not actually written in .NET.

The F#.NET Journal articles Numerical Libraries: special functions, interpolation and random numbers (16th March 2008) and Numerical Libraries: linear algebra and spectral methods (16th April 2008) tested quite a bit of functionality and NMath was actually the slowest of all the commercial libraries. Their PRNG was slower than all others and 50% slower than the free Math.NET library, some basic functionality was missing (e.g. the ability to calculate Gamma(-0.5)) and other basic functionality (the Gamma-related functions they did provide) was broken. Both Extreme Optimization and Bluebit beat NMath at the eigensolver benchmark. NMath didn't even provide a Fourier Transform at the time.

Even more surprisingly, the performance discrepancies were sometimes huge. The most expensive commercial numerical library we tested (IMSL) was over 500× slower than the free FFTW library at the FFT benchmark and none of the libraries made any use of multiple cores at the time.

In fact, it was precisely the poor quality of these libraries that encouraged us to commercialize our own F# for Numerics library (which is 100% pure F# code).



回答3:

I am one of the lead developers of ILNumerics. So I am biased, obviously ;) But we are more disclosed regarding our internals, so I will give some insights into our speed 'secrets'.

It all depends on how system resources are utilized! If you are about pure speed and need to handle large arrays, you will make sure to (ordered by importance, most important first)

  1. Manage your memory appropriately! 'Naive' memory management will lead to bad performance, since it stresses the GC badly, causes memory fragmentation and degrades memory locality (hence cache performance). In a garbage collected environment like .NET, this boils down to preventing from frequent memory allocations. In ILNumerics, we implemented a high performance memory pool in order to archieve this goal (and deterministic disposal of temporary arrays to get a nice, comfortable syntax without clumsy function semantics).

  2. Utilize parallelism! This targets both: thread level parallelism and data level parallelism. Multiple cores are utilized by threading computation intensive parts of the calculation. On X86/X64 CPUs SIMD/multimedia extensions like SSE.XX and AVX allow a small but effective vectorization. They are not directly addressable by current .NET languages. And this is the only reason, why MKL may is still faster than 'pure' .NET code. (But solutions are already rising.)

  3. To archieve the speed of highly optimized languages like FORTRAN and C++, the same optimizations must get applied to your code as done for them. C# offers the option do do so.

Note, these precautions should be followed in that order! It does not make sense to care about SSE extensions or even bound check removal, if the bottleneck is the memory bandwith and the processor(s) spend most the time waiting for new data. Also, for many simple operations it does not even pays of to invest huge efforts to archieve the very last tiny scale up to peak performance! Consider the common example of the LAPACK function DAXPY. It adds the elements of a vector X to the corresponding element of another vector Y. If this is done for the first time, all the memory for X and Y will have to get fetched from the main memory. There is little to nothing you can do about it. And memory is the bottleneck! So regardless if the addition at the end is done the naive way in C#

for (int i = 0; i < C.Length; i++) {
    C[i] = X[i] + Y[i]; 
}

or done by using vectorization strategies - it will have to wait for memory!

I know, this answer does somehow 'over answers' the question, since most of these strategies are currently not utilized from the mentioned product (yet?). By following thoses points, you would eventually end up with much better performance than every naive implementation in a 'native' language.

If you are interested, you could disclose your implementation of L-BFGS? I'll be happy to convert it to ILNumerics and post comparison results and I am sure, other libraries listed here would like to follow. (?)



回答4:

I've posted up a blog article addressing this question.



回答5:

The key is C++/CLI. It allows you to compile C++ code into a managed .NET assembly.



回答6:

Today it's industry standard to make mixed .Net/native libraries in order to take advantages of both platforms for performance optimization. Not only NMath, many commercial and free libraries with .net interface working like this. For example: Math.NET Numerics, dnAnalytics, Extreme Optimization, FinMath and many others. Integration with MKL is extremely popular for .net numerical libraries, and most of them just use Managed C++ assembly as an intermediate level. But this solution has a number of drawbacks:

  1. Intel MKL is a proprietary software and it's a bit expensive. But some libraries like dnAnalytics provides a free replacement of MKL functionality with pure .net code. Off course, it's much slower, but it's free and fully functional.

  2. It reduces your compatibility you need to have heavy managed C++ kernel dlls for both 32bit and 64bit mode.

  3. Managed to native calls need performing marshaling which slow down performance of fast frequently called operations such as Gamma or NormalCDF.

Last two problems solved in RTMath FinMath library. I don't really know how they did it, but they provide single pure .net dll which compiled for Any CPU platform and supports 32bit and 64bit. Also I didn't seen any performance degradation against MKL when I need to call NormalCDF billions times.



回答7:

Since the (native) Intel MKL is doing the math, you're actually not doing the math in managed code. You're merely using the memory manager from .Net, so the outcomes are easily used by .Net code.



回答8:

I learnt more form @Darin Dimitrov's comment to his answer and @Trevor Misfeldt's comment to @Darin's comment. Hence posting it as an answer, for future readers.

NMath uses P/Invoke or C++/CLI to call Intel Math Kernel Library native functions which is where the most intensive calculations are done and which is why it is so fast.

The time is spent in decomposition methods inside of Intel's MKL. No copying of data is required, either. So, it's not an issue of whether CLI is fast or not. It's about where the execution happens.

Also @Paul's blog is also a good read. Here's the summary.

C# is Fast, Memory Allocation Is Not. Reuse the variables as ref or out parameters, instead of returning new variables from methods. Allocating a new variable consumes memory and slows down execution. @Haymo Kutschbach has explained this well.

If the precision is not necessary, the performance gain in switching from double to single precision is considerable (not to mention the memory saving for the data storage).

For many short computations, to call a C++/cli routine from C#, pinning all pointers to data allocated in the managed space, and then call the Intel library is generally better than using P/Invoke to call the library directly from C#, due to the cost of marshaling the data . As mentioned by @Haymo Kutschbach in comments, for blittable types however, no difference between C++/CLI and C#. Arrays of blittable types and classes that contain only blittable members are pinned instead of copied during marshaling. Refer https://msdn.microsoft.com/en-us/library/75dwhxf7(v=vs.110).aspx for a list of blittable and non-blittable types.