Probably a commonplace question, but how can I parallelize this loop in Python?
for i in range(0,Nx.shape[2]):
for j in range(0,Nx.shape[2]):
NI=Nx[:,:,i]; NJ=Nx[:,:,j]
Ku[i,j] = (NI[mask!=True]*NJ[mask!=True]).sum()
So my question: what's the easiest way to parallelize this code?
---------- EDIT LATER------------------
An example of data
import random
import numpy as np
import numpy.ma as ma
from numpy import unravel_index
#my input
Nx = np.random.rand(5,5,5)
#mask creation
mask_positions = zip(*np.where((Nx[:,:,0] < 0.4)))
mask_array_positions = np.asarray(mask_positions)
i, j = mask_array_positions.T
mask = np.zeros(Nx[:,:,0].shape, bool)
mask[i,j] = True
And i want to calculate Ku by parallelizing. My aim is to use the Ku array to solve a linear problem so i have to put the mask values apart (represent near the half of my array)
I think you want to 'vectorize', to use numpy
terminology, not parallelize in the multiprocess way.
Your calculation is essentially a dot (matrix) product. Apply the mask
once to the whole array to get a 2d array, NIJ
. Its shape will be (N,5)
, where N
is the number of True
values in ~mask
. Then it's just a (5,N)
array 'dotted' with a (N,5)
- ie. sum over the N
dimension, leaving you with a (5,5)
array.
NIJ = Nx[~mask,:]
Ku = np.dot(NIJ.T,NIJ)
In quick tests it matches the Ku
produced by your double loop. Depending on the underlying library used for np.dot
there might be some multicore calculation, but that's usually not a priority issue for numpy
users.
Applying the large boolean mask
is the most time consuming part of these calculations - both the vectorized and iterative versions.
For a mask
with 400,000 True values, compare these 2 indexing times:
In [195]: timeit (NI[:400,:1000],NJ[:400,:1000])
100000 loops, best of 3: 4.87 us per loop
In [196]: timeit (NI[mask],NJ[mask])
10 loops, best of 3: 98.8 ms per loop
Selecting the same number of items with basic (slice) indexing is several orders of magnitude faster than advanced indexing with the mask
.
Substituting np.dot(NI[mask],NJ[mask])
for (NI[mask]*NJ[mask]).sum()
only saves a few ms.
I'd like to extend @hpaulj's excellent answer (great analysis of the problem too by the way) for large matrices.
The operation
Ku = np.dot(NIJ.T,NIJ)
can be replaced by
Ku = np.einsum('ij,ik->jk', NIJ, NIJ)
It should also be noted that np.dot
could fall back to slower routines if numpy was not compiled to use BLAS.
For a test matrix NIJ
of shape (1250711, 50)
, I got 54.9 s
with the dot
method, while the einsum
does it in 1.67 s
. On my system, numpy is compiled with BLAS support.
Remark: np.einsum
does not always outperform np.dot
, a situation that became apparent on my system when you would compare any of the following
Nx = np.random.rand(1400,1528,20).astype(np.float16)
Nx = np.random.rand(1400,1528,20).astype(np.float32)
(or even a dtype of np.float64
).