Any Advantage of MPI+CUDA over just pure MPI?

2019-05-06 14:57发布

问题:

The usual way to speed up an application is to parallelize an application using MPI or higher level libraries like PETSc which use MPI under the hood.

However nowadays everyone seems to be interested in using CUDA for parallelizing their application or using a hybrid of MPI and CUDA for more ambitious/larger problems.

Is there any noticeable advantage in using a hybrid MPI+CUDA programming model over the traditional , tried and tested MPI model of parallel programming? I am asking this specifically in the application domains of particle methods

One reason why I am asking this question is that everywhere on the web I see the statement that "Particle methods map naturally to the architecture of GPU's" or some variation of this. But never do they seem to justify why I would be better of using CUDA than using just MPI for the same job.

回答1:

This is a bit apples and oranges.

MPI and CUDA are fundamentally different architectures. Most importantly, MPI lets you distribute your application over several nodes, while CUDA lets you use the GPU within the local node. If in an MPI program your parallel processes take too long to finish, then yes, you should look into how they could be sped up by using the GPU instead of the CPU to do their work. Conversely, if your CUDA application still takes too long to finish, you may want to distribute the work to multiple nodes using MPI.

The two technologies are pretty much orthogonal (assuming all the nodes on your cluster are CUDA-capable).



回答2:

Just to build on the other poster's already good answer, some high-level discussion of what kinds of problems GPUs are good at, and why.

GPUs have followed a dramatically different design path from CPUs, because of their distinct origins. Compared to CPU cores, GPU cores contain more ALUs and FP hardware and less control logic and cache. This means that GPUs can provide more efficiency for straight computations, but only code with regular control flow and smart memory access patterns will see the best benefit: up to over a TFLOPS for SP FP code. GPUs are designed to be high-throughput, high-latency devices at control and memory levels. Globally accessible memory has a long, wide bus, so that coalesced (contiguous and aligned) memory accesses achieve good throughput despite long latency. Latencies are hidden by requiring massive thread-parallelism and providing essentially zero-overhead context switching by the hardware. GPUs employ an SIMD-like model, SIMT, whereby groups of cores execute in SIMD lockstep (different groups being free to diverge), without forcing the programmer to reckon with this fact (except to achieve best performance: on Fermi, this could make a difference of up to 32x). SIMT lends itself to the data parallel programming model, whereby data independence is exploited to perform similar processing on a large array of data. Efforts are being made to generalize GPUs and their programming model, as well as to ease programming for good performance.