I'm using non-blocking communication in MPI to send various messages between processes. However, I appear to be getting a deadlock. I have used PADB (see here) to look at the message queues and have got the following output:
1:msg12: Operation 1 (pending_receive) status 0 (pending)
1:msg12: Rank local 4 global 4
1:msg12: Size desired 4
1:msg12: tag_wild 0
1:msg12: Tag desired 16
1:msg12: system_buffer 0
1:msg12: Buffer 0xcaad32c
1:msg12: 'Receive: 0xcac3c80'
1:msg12: 'Data: 4 * MPI_FLOAT'
--
1:msg32: Operation 0 (pending_send) status 2 (complete)
1:msg32: Rank local 4 global 4
1:msg32: Actual local 4 global 4
1:msg32: Size desired 4 actual 4
1:msg32: tag_wild 0
1:msg32: Tag desired 16 actual 16
1:msg32: system_buffer 0
1:msg32: Buffer 0xcaad32c
1:msg32: 'Send: 0xcab7c00'
1:msg32: 'Data transfer completed'
--
2:msg5: Operation 1 (pending_receive) status 0 (pending)
2:msg5: Rank local 1 global 1
2:msg5: Size desired 4
2:msg5: tag_wild 0
2:msg5: Tag desired 16
2:msg5: system_buffer 0
2:msg5: Buffer 0xabbc348
2:msg5: 'Receive: 0xabd1780'
2:msg5: 'Data: 4 * MPI_FLOAT'
--
2:msg25: Operation 0 (pending_send) status 2 (complete)
2:msg25: Rank local 1 global 1
2:msg25: Actual local 1 global 1
2:msg25: Size desired 4 actual 4
2:msg25: tag_wild 0
2:msg25: Tag desired 16 actual 16
2:msg25: system_buffer 0
2:msg25: Buffer 0xabbc348
2:msg25: 'Send: 0xabc5700'
2:msg25: 'Data transfer completed'
This seems to have showed that sends have completed, but all of the receives are pending (the above is just an small part of the log for a tag value of 16). However, how can this happen? Surely sends can't complete without the associated receive completing, as in MPI all sends and receives have to match. At least that's what I thought...
Can anyone provide any insights?
I can provide the code I'm using to do this, but surely Isend and Irecv should work regardless of what order they are all called in, assuming that MPI_Waitall is called right at the end.
Update: Code is available at this gist
Update: I've made various modifications to the code, but it still isn't working quite properly. The new code is at the same gist, and the output I'm getting is at this gist. I have a number of questions/issues with this code:
Why is the output from the final loop (printing all of the arrays) interspersed with the rest of the output when I have a MPI_Barrier() before it to make sure all of the work has been done before printing it out?
It is possible/sensible to be sending from rank 0 to rank 0 - will that work ok? (assuming a correct matching receive is posted, of course).
I'm getting lots of very strange long numbers in the output, which I assume is some kinda of memory-overwriting problem, or sizes of variables problem. The interesting thing is that this must be resulting from the MPI communications, because I initialise new_array to a value of 9999.99 and the communication obviously causes it to be changed to these strange values. Any ideas why?
Overall it seems that some of the transposition is occurring (bits of the matrix seem to be transposed...), but definitely not all of it - it's these strange numbers that are coming up that are worrying me the most!
When using
MPI_Isend
andMPI_Irecv
you have to be sure to not modify the buffers before you wait for the request to complete, and you are definitely violating this. What if you had the recieves all go into a second matrix instead of doing it in place?Also,
global_x2 * global_y2
is your tag, but I'm not sure that it will be unique for every send-recieve pair, which could be messing things up. What happens if you switch it to sending tag(global_y2 * global_columns) + global_x2
and recieving tag(global_x2 * global_columns) + global_y2
.Edit: As for your question about output, I'm assuming you are testing this by running all your processes on the same machine and just looking at the standard output. When you do it this way, your output gets buffered oddly by the terminal, even though the printf code all executes before the barrier. There are two ways I get around this. You could either print to a separate file for each process, or you could send your output as messages to process 0 and let him do all the actual printing.