Probably a commonplace question, but how can I parallelize this loop in Python?
for i in range(0,Nx.shape[2]):
for j in range(0,Nx.shape[2]):
NI=Nx[:,:,i]; NJ=Nx[:,:,j]
Ku[i,j] = (NI[mask!=True]*NJ[mask!=True]).sum()
So my question: what's the easiest way to parallelize this code?
---------- EDIT LATER------------------
An example of data
import random
import numpy as np
import numpy.ma as ma
from numpy import unravel_index
#my input
Nx = np.random.rand(5,5,5)
#mask creation
mask_positions = zip(*np.where((Nx[:,:,0] < 0.4)))
mask_array_positions = np.asarray(mask_positions)
i, j = mask_array_positions.T
mask = np.zeros(Nx[:,:,0].shape, bool)
mask[i,j] = True
And i want to calculate Ku by parallelizing. My aim is to use the Ku array to solve a linear problem so i have to put the mask values apart (represent near the half of my array)
I think you want to 'vectorize', to use
numpy
terminology, not parallelize in the multiprocess way.Your calculation is essentially a dot (matrix) product. Apply the
mask
once to the whole array to get a 2d array,NIJ
. Its shape will be(N,5)
, whereN
is the number ofTrue
values in~mask
. Then it's just a(5,N)
array 'dotted' with a(N,5)
- ie. sum over theN
dimension, leaving you with a(5,5)
array.In quick tests it matches the
Ku
produced by your double loop. Depending on the underlying library used fornp.dot
there might be some multicore calculation, but that's usually not a priority issue fornumpy
users.Applying the large boolean
mask
is the most time consuming part of these calculations - both the vectorized and iterative versions.For a
mask
with 400,000 True values, compare these 2 indexing times:Selecting the same number of items with basic (slice) indexing is several orders of magnitude faster than advanced indexing with the
mask
.Substituting
np.dot(NI[mask],NJ[mask])
for(NI[mask]*NJ[mask]).sum()
only saves a few ms.I'd like to extend @hpaulj's excellent answer (great analysis of the problem too by the way) for large matrices.
The operation
can be replaced by
It should also be noted that
np.dot
could fall back to slower routines if numpy was not compiled to use BLAS.For a test matrix
NIJ
of shape(1250711, 50)
, I got54.9 s
with thedot
method, while theeinsum
does it in1.67 s
. On my system, numpy is compiled with BLAS support.Remark:
np.einsum
does not always outperformnp.dot
, a situation that became apparent on my system when you would compare any of the following(or even a dtype of
np.float64
).