Collective diagonal neighborhood communication

I have a Cartesian process topology in 3D. However, I describe my problem in 2D to simplify it.

For the collective nearest neighbor communication (left image) I use MPI_Neighbor_alltoallw() which allows send and receive of different datatypes. However this function does not work for diagonal neighbors (right image) and I need another function for diagonal neighbors.

Left: nearest neighbors are green neighbors. Right: red grids are nearest diagonal neighbor.

What I have in my mind to implement diagonal neighbor communication is:

int main_rank;           // rank of the gray process
int main_coords[2];      // coordinates of the gray process
MPI_Comm_rank (comm_cart, &main_rank);
MPI_Cart_coords (comm_cart, main_rank, 2, main_coords);

// finding the rank of the top-right neighbor
int top_right_rank;
int top_right_coords[2] = {main_coords[0]+1, main_coords[1]+1};
MPI_Cart_rank (comm_cart, top_right_coords, &top_right_rank);

// SEND DATA: MPI_Isend(...);    
// RECEIVE DATA: MPI_Irecv(...);
// MPI_Waitall(...);

// REPEAT FOR OTHER DIAGONAL NEIGHBORS

Question

Is there any collective diagonal neighborhood communication in MPI standard?
What is the efficient and less error-prone implementation?
Do you have any suggestion to make my implementation better?

This is a common question of how to update ghost cells/halos in MPI... In Fact there is an elegant solution to this problem.... THERE IS NO NEED OF DIAGONAL COMMUNICATION :-).

So how to do it without those painful diagonals :-)...

Lets do a simple example of a 2-torus (2x2) with 4 processes and a 1 sized halo.

x x x  x x x
x 1 x  x 2 x
x x x  x x x

x x x  x x x
x 3 x  x 4 x    
x x x  x x x

First lets work on the vertical direction: Here we only send that data outside the ghost cells.

x 3 x  x 4 x
x 1 x  x 2 x
x 3 x  x 4 x

x 1 x  x 2 x
x 3 x  x 4 x
x 1 x  x 2 x

Now lets work out the horizontal direction... But this time we also send the ghost cells...

x 3 x    3   x 4 x
x 1 x -> 1 ->x 2 x
x 3 x    3   x 4 x

So we get:

4 3 4  3 4 3
2 1 2  1 2 1
4 3 4  3 4 3

4 1 2  1 2 1
4 3 4  3 4 3
2 1 2  1 2 1

That is the elegant (and most efficient) way of doing it... diagonals communication are replaced by 2 communication (which are needed during the process anyway)....