MPI help on how to parallelize my code

I am very much a newbie in this subject and need help on how to parallelize my code. I have a large 1D array that in reality describes a 3D volume: 21x21x21 single precision values. I have 3 computers that I want to engage in the computation. The operation that is performed on each cell in the grid(volume) is identical for all cells. The program takes in some data and perform some simple arithmetics on them and the return value is assigned to the grid cell.

My non-parallized code is:

float zg, yg, xg;
stack_result = new float[Nz*Ny*Nx];
// StrMtrx[8] is the vertical step size, StrMtrx[6]  is the vertical starting point 
for (int iz=0; iz<Nz; iz++) {
  zg = iz*StRMtrx[8]+StRMtrx[6];  // find the vertical position in meters
  // StrMtrx[5] is the crossline step size, StrMtrx[3]  is the crossline starting point
  for (int iy=0; iy<Ny; iy++) {
    yg = iy*StRMtrx[5]+StRMtrx[3];  // find the crossline position
    // StrMtrx[2] is the inline step size, StrMtrx[0]  is the inline starting point
    for (int ix=0; ix < nx; ix++) { 
      xg = ix*StRMtrx[2]+StRMtrx[0]; // find the inline position
      // do stacking on each grid cell
      // "Geoph" is the geophone ids, "Ngeo" is the number of geophones involved,
      // "pahse_use" is the wave type, "EnvMtrx" is the input data common to all
      // cells, "Mdata" is the length of input data
      stack_result[ix+Nx*iy+Nx*Ny*iz] =
        stack_for_qds(Geoph, Ngeo, phase_use, xg, yg, zg, EnvMtrx, Mdata);  
    }        
  }
}

Now I take in 3 computers and divide the volume in 3 vertical segments, so I would then have 3 sub-volumes each 21x21x7 cells. (note the parsing of the volume is in z,y,x). The variable "stack_result" is the complete volume. My parallellized version (which utterly fails, I only get one of the sub-volumes back) is:

MPI_Status status;
int rank, numProcs, rootProcess;
ierr = MPI_Init(&argc, &argv);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
int rowsInZ = Nz/numProcs;  // 7 cells in Z (vertical)
int chunkSize = Nx*Ny*rowsInZ;
float *stack_result = new float[Nz*Ny*Nx];
float zg, yg, xg;
rootProcess = 0;
if(rank == rootProcess) {
  offset = 0;
  for (int n = 1; n < numProcs; n++) { 
    // send rank
    MPI_Send(&n, 1, MPI_INT, n, 2, MPI_COMM_WORLD);
    // send the offset in array
    MPI_Send(&offset, 1, MPI_INT, n, 2, MPI_COMM_WORLD);
    // send volume, now only filled with zeros,
    MPI_Send(&stack_result[offset], chunkSize, MPI_FLOAT, n, 1, MPI_COMM_WORLD);
    offset = offset+chunkSize;
  }
  // receive results
  for (int n = 1; n < numProcs; n++) { 
    int source = n;
    MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
    MPI_Recv(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD, &status);
  }
}  else {
  int rank;
  int source = 0;
  int ierr = MPI_Recv(&rank, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
  ierr = MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
  ierr = MPI_Recv(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD, &status);       
  int nz = rowsInZ;  // sub-volume vertical length
  int startZ = (rank-1)*rowsInZ;
  for (int iz = startZ; iz < startZ+nz; iz++) {
    zg = iz*StRMtrx[8]+StRMtrx[6];
    for (int iy = 0; iy < Ny; iy++) {
      yg = iy*StRMtrx[5]+StRMtrx[3];
      for (int ix = 0; ix < Nx; ix++) {
        xg = ix*StRMtrx[2]+StRMtrx[0];
        stack_result[offset+ix+Nx*iy+Nx*Ny*iz]=
          stack_for_qds(Geoph, Ngeo, phase_use, xg, yg, zg, EnvMtrx, Mdata);
      }  // x-loop
    }  // y-loop
  }   // z-loop
  MPI_Send(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD);
  MPI_Send(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD);
}  // else
write("stackresult.dat", stack_result);
delete [] stack_result;
MPI_Finalize();

Thanks in advance for your patience.

You are calling write("stackresult.dat", stack_result); in all MPI ranks. As a result, they all write into and thus overwrite the same file and what you see is the content written by the last MPI process to execute that code statement. You should move the writing into the body of the if (rank == rootProcess) conditional so that only the root process will write.

As a side note, sending the value of the rank is redundant - MPI already assigns each process a rank that ranges from 0 to #processes - 1. That also makes sending of the offset redundant since each MPI process could easily compute the offset on its own based on its rank.