MPI C++ matrix addition, function arguments, and f

I've been learning C++ from the internet for the past 2 years and finally the need has arisen for me to delve into MPI. I've been scouring stackoverflow and the rest of the internet (including http://people.sc.fsu.edu/~jburkardt/cpp_src/mpi/mpi.html and https://computing.llnl.gov/tutorials/mpi/#LLNL). I think I've got some of the logic down, but I'm having a hard time wrapping my head around the following:

#include (stuff)
using namespace std;

vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows);

int main(int argc, char** argv)
{
    vector<double> result;//represents a regular 1D vector
    int id_proc, tot_proc, root_proc = 0;
    int dim;//set to number of "columns" in A and B below
    int rows;//set to number of "rows" of A and B below
    vector<double> A(dim*rows), B(dim*rows);//represent matrices as 1D vectors

    MPI::Init(argc,argv);
    id_proc = MPI::COMM_WORLD.Get_rank();
    tot_proc = MPI::COMM_WORLD.Get_size();

    /*
    initialize A and B here on root_proc with RNG and Bcast to everyone else
    */

    //allow all processors to call function() so they can each work on a portion of A
    result = function(A,B,dim,rows);

    //all processors do stuff with A
    //root_proc does stuff with result (doesn't matter if other processors have updated result)

    MPI::Finalize();
    return 0;
}

vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows)
{
    /*
    purpose of function() is two-fold:
    1. update foo because all processors need the updated "matrix"
    2. get the average of the "rows" of foo and return that to main (only root processor needs this)
    */

    vector<double> output(dim,0);

    //add matrices the way I would normally do it in serial
    for (int i = 0; i < rows; i++)
    {
        for (int j = 0; j < dim; j++)
        {
            foo[i*dim + j] += bar[i*dim + j];//perform "matrix" addition (+= ON PURPOSE)
        }
    }

    //obtain average of rows in foo in serial
    for (int i = 0; i < rows; i++)
    {
        for (int j = 0; j < dim; j++)
        {
            output[j] += foo[i*dim + j];//sum rows of A
        }
    }

    for (int j = 0; j < dim; j++)
    {
            output[j] /= rows;//divide to obtain average
    }

    return output;        
}

The code above is to illustrate the concept only. My main concern is to parallelize the matrix addition but what boggles my mind is this:

1) If each processor only works on a portion of that loop (naturally I'd have to modify the loop parameters per processor) what command do I use to merge all portions of A back into a single, updated A that all processors have in their memory. My guess is that I have to do some kind of Alltoall where each processor sends its portion of A to all other processors, but how do I guarantee that (for example) row 3 worked on by processor 3 overwrites row 3 of the other processors, and not row 1 by accident.

2) If I use an Alltoall inside function(), do all processors have to be allowed to step into function(), or can I isolate function() using...

if (id_proc == root_proc)
{
    result = function(A,B,dim,rows);
}

… and then inside function() handle all the parallelization. As silly as it sounds, I'm trying to do a lot of the work on one processor (with broadcasts), and just parallelize the big time-consuming for loops. Just trying to keep the code conceptually simple so I can get my results and move on.

3) For the averaging part, I'm sure I can just use a reducing command if I wanted to parallelize it, correct?

Also, as an aside: is there a way to call Bcast() such that it is blocking? I'd like to use it to synchronize all my processors (boost libraries are not an option). If not then I'll just go with Barrier(). Thank you for your answer to this question, and to the community of stackoverflow for learning me how to program over the past two years! :)

1) The function you are looking is MPI_Allgather. MPI_Allgather will let you send a row from each processor and receive the result on all processors.

2) Yes you can use some of the processors in your function. Since MPI functions work with communicators you have to create a separate communicator for this purpose. I don't know how this is implemented in the C++ bindings but C bindings use the MPI_Comm_create function.

3) Yes see MPI_Allreduce.

aside: Bcast blocks a process until send/receive operation assigned to that process is finished. If you want to wait for all processors to finish their work (I don't have any idea why you would want to do this) you should use Barrier().

extra note: I wouldn't recommend using the C++ bindings as they are depreciated and you won't find specific examples on how to use them. Boost MPI is the library to use if you want C++ bindings however it does not cover all of MPI functions.