Python - parallelize a python loop for 2D masked a

2019-09-06 10:17发布

问题:

Probably a commonplace question, but how can I parallelize this loop in Python?

for i in range(0,Nx.shape[2]):
  for j in range(0,Nx.shape[2]):
    NI=Nx[:,:,i]; NJ=Nx[:,:,j]
    Ku[i,j] = (NI[mask!=True]*NJ[mask!=True]).sum()

So my question: what's the easiest way to parallelize this code?

         ---------- EDIT LATER------------------

An example of data

import random
import numpy as np
import numpy.ma as ma
from numpy import unravel_index    

#my input
Nx = np.random.rand(5,5,5)  

#mask creation
mask_positions = zip(*np.where((Nx[:,:,0] < 0.4)))
mask_array_positions = np.asarray(mask_positions)
i, j = mask_array_positions.T
mask = np.zeros(Nx[:,:,0].shape, bool)
mask[i,j] = True

And i want to calculate Ku by parallelizing. My aim is to use the Ku array to solve a linear problem so i have to put the mask values apart (represent near the half of my array)

回答1:

I think you want to 'vectorize', to use numpy terminology, not parallelize in the multiprocess way.

Your calculation is essentially a dot (matrix) product. Apply the mask once to the whole array to get a 2d array, NIJ. Its shape will be (N,5), where N is the number of True values in ~mask. Then it's just a (5,N) array 'dotted' with a (N,5) - ie. sum over the N dimension, leaving you with a (5,5) array.

NIJ = Nx[~mask,:]
Ku = np.dot(NIJ.T,NIJ)

In quick tests it matches the Ku produced by your double loop. Depending on the underlying library used for np.dot there might be some multicore calculation, but that's usually not a priority issue for numpy users.


Applying the large boolean mask is the most time consuming part of these calculations - both the vectorized and iterative versions.

For a mask with 400,000 True values, compare these 2 indexing times:

In [195]: timeit (NI[:400,:1000],NJ[:400,:1000])
100000 loops, best of 3: 4.87 us per loop
In [196]: timeit (NI[mask],NJ[mask])
10 loops, best of 3: 98.8 ms per loop

Selecting the same number of items with basic (slice) indexing is several orders of magnitude faster than advanced indexing with the mask.

Substituting np.dot(NI[mask],NJ[mask]) for (NI[mask]*NJ[mask]).sum() only saves a few ms.



回答2:

I'd like to extend @hpaulj's excellent answer (great analysis of the problem too by the way) for large matrices.

The operation

Ku = np.dot(NIJ.T,NIJ)

can be replaced by

Ku = np.einsum('ij,ik->jk', NIJ, NIJ)

It should also be noted that np.dot could fall back to slower routines if numpy was not compiled to use BLAS.

For a test matrix NIJ of shape (1250711, 50), I got 54.9 s with the dot method, while the einsum does it in 1.67 s. On my system, numpy is compiled with BLAS support.

Remark: np.einsum does not always outperform np.dot, a situation that became apparent on my system when you would compare any of the following

Nx = np.random.rand(1400,1528,20).astype(np.float16)
Nx = np.random.rand(1400,1528,20).astype(np.float32)

(or even a dtype of np.float64).