MPI_SEND需要虚拟内存的很大一部分(MPI_SEND takes huge part of v

调试我的内核上的大计数程序，我面临着非常奇怪的错误insufficient virtual memory 。我的调查导致的代码，其中主设备发送小消息到每个从和平。然后我写了个小程序，其中1个主只需发送10个整数与MPI_SEND和所有的奴隶与接收它MPI_RECV 。的文件比较/proc/self/status之前和之后MPI_SEND显示，存储容量之间的差异是巨大的！最有趣的事情（这我崩溃程序），是这个内存不会取消分配后MPI_Send ，仍然需要巨大的空间。

有任何想法吗？

 System memory usage before MPI_Send, rank: 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   251400 kB                                                                                 
VmSize:   186628 kB                                                                                 
VmLck:        72 kB                                                                                  
VmHWM:      4068 kB                                                                                  
VmRSS:      4068 kB                                                                                  
VmData:    71076 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       148 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3                                                                                          

 System memory usage after MPI_Send, rank 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   456880 kB                                                                                 
VmSize:   456872 kB                                                                                 
VmLck:    257884 kB                                                                                  
VmHWM:    274612 kB                                                                                  
VmRSS:    274612 kB                                                                                  
VmData:   341320 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       676 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3

Answer 1:

这是几乎运行在InfiniBand任何MPI实现预期的行为。 IB的RDMA机制要求数据缓冲区应进行登记，即它们首先被锁定在一个固定的位置在物理内存中，然后司机告诉的InfiniBand HCA如何将虚拟地址映射到物理内存。这是非常复杂的，因此非常缓慢的过程由IB HCA注册为使用的内存，这就是为什么大多数的MPI实现从未注销曾经在希望相同的内存后来被再次用作源或数据对象的登记记忆。如果注册的内存是堆内存，它永远不会返回到操作系统，这就是为什么你的数据段只生长在大小。

重用发送和接收尽可能多的缓冲区。请记住，在InfiniBand通信incurrs高的内存开销。大多数人真的不认为这件事，它通常是记录不完整，但InfiniBand的采用了很多这是在进程的内存分配和这些队列与进程的数目显著增加特殊的数据结构（队列）的。在一些完全连接箱子队列存储器的量可以是如此之大，没有存储器实际上是留给应用。

有迹象表明，控制由英特尔MPI使用IB队列某些参数。在你的情况下，最重要的是I_MPI_DAPL_BUFFER_NUM控制预分配和预注册的内存量。它的默认值是16 ，所以你可能要减少它。要知道可能的性能影响，虽然。您也可以尝试通过设置使用动态预分配的缓冲区的大小I_MPI_DAPL_BUFFER_ENLARGEMENT至1 。启用该选项后，英特尔MPI将首先登记小的缓冲区，如果需要，将在稍后他们成长。还需要注意的是IMPI打开连接懒洋洋地，这就是为什么你看到的只是调用后使用的内存大幅增加MPI_Send 。

如果不使用DAPL运输，例如，使用ofa运输代替，没有太多可以做的。您可以通过设置使XRC队列I_MPI_OFA_USE_XRC至1 。这应该以某种方式降低使用的内存。此外，通过设置启用动态队列对创建I_MPI_OFA_DYNAMIC_QPS至1可能会降低内存的使用，如果你的程序的通信图是未完全连接（完全连接的程序是一个在每个级别会谈，所有其他等级）。

Answer 2:

Hristo's answer is mostly right, but since you are using small messages there's a bit of a difference. The messages end up on the eager path: they first get copied to an already-registered buffer, then that buffer is used for the transfer, and the receiver copies the message out of an eager buffer on their end. Reusing buffers in your code will only help with large messages.

This is done precisely to avoid the slowness of registering the user-supplied buffer. For large messages the copy takes longer than the registration would, so the rendezvous protocol is used instead.

These eager buffers are somewhat wasteful. For example, they are 16kB by default on Intel MPI with OF verbs. Unless message aggregation is used, each 10-int-sized message is eating four 4kB pages. But aggregation won't help when talking to multiple receivers anyway.

So what to do? Reduce the size of the eager buffers. This is controlled by setting the eager/rendezvous threshold (I_MPI_RDMA_EAGER_THRESHOLD environment variable). Try 2048 or even smaller. Note that this can result in a latency increase. Or change the I_MPI_DAPL_BUFFER_NUM variable to control the number of these buffers, or try the dynamic resizing feature that Hristo suggested. This assumes your IMPI is using DAPL (the default). If you are using OF verbs directly, the DAPL variables won't work.

Edit: So the final solution for getting this to run was setting I_MPI_DAPL_UD=enable. I can speculate on the origin of the magic, but I don't have access to Intel's code to actually confirm this.

IB can have different transport modes, two of which are RC (Reliable Connected) and UD (Unreliable Datagram). RC requires an explicit connection between hosts (like TCP), and some memory is spent per connection. More importantly, each connection has those eager buffers tied to it, and this really adds up. This is what you get with Intel's default settings.

There is an optimization possible: sharing the eager buffers between connections (this is called SRQ - Shared Receive Queue). There's a further Mellanox-only extension called XRC (eXtended RC) that takes the queue sharing further: between the processes that are on the same node. By default Intel's MPI accesses the IB device through DAPL, and not directly through OF verbs. My guess is this precludes these optimizations (I don't have experience with DAPL). It is possible to enable XRC support by setting I_MPI_FABRICS=shm:ofa and I_MPI_OFA_USE_XRC=1 (making Intel MPI use the OFA interface instead of DAPL).

When you switch to the UD transport you get a further optimization on top of buffer sharing: there is no longer a need to track connections. The buffer sharing is natural in this model: since there are no connections, all the internal buffers are in a shared pool, just like with SRQ. So there are further memory savings, but at a cost: datagram delivery can potentially fail, and it is up to the software, not the IB hardware to handle retransmissions. This is all transparent to the application code using MPI, of course.

文章来源: MPI_SEND takes huge part of virtual memory