Parallel assembly of a sparse matrix in python

I'm trying to use mpi4py to assemble a very large sparse matrix in parallel. Each rank produces a sparse sub matrix (in scipy's dok format) that needs to be put in place in the very large matrix. So far I have succeeded if each rank produces a numpy array containing the indices and the values of the nonzero values (mimicking the coo format). After the gather procedure I can assemble the large matrix from the numpy arrays. The final matrix is to be written to disk as an mtx format file.

What is most efficient way of gathering the sparse submatrices? perhaps, passing them directly as arguments to gather()? but how?

Here's a simplified example of what I do: It assembles a large diagonal matrix out of diagonal submatrices, in the real case the resulting large matrix is typically 500000x500000 in size and not diagonal.

from mpi4py import MPI
from numpy import *
import time
import scipy.sparse as ss
import scipy.io as sio

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    tic = time.clock()      

# each rank generates a sparse matrix with N entries on the diagonal
N = 10000
tmp = ss.eye(N, format = 'dok') * rank

# extract indices and values
i,j = tmp.nonzero()
val = tmp.values()

# create the output array of each rank   
out = zeros((size(val),3))

# fill the output numpy array, shifting the indices according to the rank
out[:,0] = val
out[:,1] = i + rank * N
out[:,2] = j + rank * N

# gather all the arrays representing the submatrices
full_array = comm.gather(out,root=0)

if rank == 0:

    sp = shape(full_array)
    f = reshape(full_array, (sp[0]*sp[1],sp[2]))

    # this is the final result
    final_result = ss.csr_matrix( ( f[:,0], (f[:,1], f[:,2]) ) )
    sio.mmwrite('final.mtx', final_result)
    toc = time.clock()
    print 'Matrix assembled and written in', toc-tic, 'seconds'

For what is worth, using three element lists work pretty well as suggested by hpaulj. Here's a working example:

from mpi4py import MPI
from numpy import *
import scipy.sparse as ss
from timeit import default_timer as timer

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    tic = timer()      

# each rank generates a sparse matrix with N entries on the diagonal
N = 100000
block = ss.eye(N, format = 'coo')

# extract indices and values
out = [ block.data, block.row , block.col]
out[1] = out[1] + rank * N
out[2] = out[2] + rank * N

# gather all the arrays representing the submatrices
full_list = comm.gather(out,root=0)

if rank == 0:
    dat = concatenate([x[0] for x in full_list])
    row = concatenate([x[1] for x in full_list])
    col = concatenate([x[2] for x in full_list])
    final_result = ss.csr_matrix( ( dat, (row, col) ) )
    toc = timer()
    print 'Matrix assembled in', toc-tic, 'seconds'

The assembly is definitely much faster using coo matrices rather than dok.