I work with topographic data. For one particular problem, I have written a function in Python which uses a moving window of a particular size to zip through a matrix (grid of elevations). Then I have to perform an analysis on this window and set the cell at the center of the window a resulting value.
My final output is a matrix the same size as my original matrix which has been altered according to my analysis. This problem takes 11 hours to run on a small area, so I thought parallelizing the inner loop would accelerate things. Alternatively, there may be a clever vectorization solution too...
See my function below, DEM
is a 2D numpy array, w
is the size of the window.
def RMSH_det(DEM, w):
import numpy as np
from scipy import signal
[nrows, ncols] = np.shape(DEM)
#create an empty array to store result
rms = DEM*np.nan
# nw=(w*2)**2
# x = np.arange(0,nw)
for i in np.arange(w+1,nrows-w):
for j in np.arange(w+1,ncols-w):
d1 = np.int64(np.arange(i-w,i+w))
d2 = np.int64(np.arange(j-w,j+w))
win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]
if np.max(np.isnan(win)) == 1:
rms[i,j] = np.nan
else:
win = signal.detrend(win, type = 'linear')
z = np.reshape(win,-1)
nz = np.size(z)
rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
rms[i,j] = rootms
return(rms)
I have scoured SO/SE for solutions to my question and come across many examples of nested for loops and trying to run them in parallel. I've struggled to adapt my code to match the examples and would appreciate some help. A solution to this problem would help me work with several other moving window functions I have.
Thus far, I have moved the inner loop into it's own function, which can be called from within the outer loop:
def inLoop(i, w, DEM,rms,ncols):
for j in np.arange(w+1,ncols-w):
d1 = np.int64(np.arange(i-w,i+w))
d2 = np.int64(np.arange(j-w,j+w))
win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]
if np.max(np.isnan(win)) == 1:
rms[i,j] = np.nan
else:
win = signal.detrend(win, type = 'linear')
z = np.reshape(win,-1)
nz = np.size(z)
rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
rms[i,j] = rootms
return(rms)
But I wasn't sure of the correct way of coding the call to Pool with the necessary variables that need to be input into the inner loop. See the outerloop below:
for i in np.arange(w+1,nrows-w):
number_of_workers = 8
with Pool(number_of_workers) as p:
#call the pool
p.starmap(inLoop, [i, w, DEM, rms, ncols])
Remaining questions:
Can this code even be optimized by parallelizing?
How do I successfully store the result of a parallelized nested-for loop?
given due explanations were provided, for which I thank the O/P author:
I tried to review the code and setup a mock-up of a bit more efficient code, before moving into putting all popular and ready-to-use
numpy + numba
steroids in, and the interimnumpy
-only result workson a sample of
[100,100]
DEM-grid for about~ 6 [s]
at the said kernel window widthw = 10
The same, for
[200,200]
DEM-grid, takes under~ 36 [s]
- obviously, the scaling is~ O( N^2 )
The same, for
[1000,1000]
DEM-grid, took under~ 1077 [s] ~ 17.6 [min]
wow !A field
.jit
trial on[1000,1000]
DEM-grid is currently under the test and will update the post once finished + once thenumba.jit()
code will enjoy to run the further accelerated resultsSo far, quite promising, isn't it?
If you @morrismc test your as-is code now, on a
[100,100]
-matrix, we can already guess the achieved range of the principal speedup, even before running tests are completed.All this on
scipy
1.2.1, thus without the benefits from 1.3.1 possible further speedupsA
numba.jit()
LLVM-compiled code. Ooops, slower ?numba.jit()
-acceleration has shown about200 [ms]
worse runtime on[100,100]
DEM-grid, with signature having been specified ( so no ad-hoc analysis costs were accrued here ) andnogil = True
( '0.43.1+0.g8dabe7abe.dirty' not being the most recent, yet )Guess there is nothing more to gain here, without moving the game into compiled
Cython
territories, yet having about tens of minutes instead of tens of hours, the Alea Iacta Est - just thenumpy
smart-vectorised code rulez!EPILOGUE :
If the original algorithm was correct (and some doubts were left in the source-code for any further improvement work), any attempt to run some other form of
[PARALLEL]
code-execution-flow will not help here ( kernel-windows[w,w] are very-small and non-contiguous areas of the DEM-grid memory-layout, memory-I/O costs are dominant part of the run-time budget here, and some nicer indexing may improve the cache-line re-use, yet the overall efforts are well beyond the budget, as the target of going down from~ 11 [hrs]
to about~ 6 [hrs]
was more than successfully met with about~ 20 [min]
runtimes achievable for[1300,1100]
float32 DEM-gridsThe code was left as-is ( non-PEP-8 ), because of all the add-on didactic value for the
[DOC.me], [TEST.me]
and[PERF.me]
phases of QA, so all kind PEP-isto-evangelisators do bear with the O/P author's view onto a full-scree-width layout left, so as to allow to understand WHY and to improve the code, which with stripped-off-comments would lose her/his way forwards in improving the code performance further on. Thx.A solution using Numba
In some cases this is very easy to do, if all functions which you use are supported. In your code
win = signal.detrend(win, type = 'linear')
is the part you have to implement in Numba, because this function isn't supported.Implementing detrend in Numba
If you look at the source-code of detrend, and extract the relevant parts for your problem, it may look like this:
I also implemented a faster solution for
np.max(np.isnan(win)) == 1
Main function
As I used Numba here, the parallelization is very simple, just a prange on the outer loop and
Timings (small example)
Timings for larger arrays
[Edit] Implemenation using normal Equations
Overdetermined system
This method has a lower numerical precision. Although this solution is quite a lot faster.
Timings
Optimized solution using normal equations
Temporary arrays are reused to avoid costly memory allocations and a custom implementation for matrix multiplication is used. This is only recommendable for very small matrices, in most other cases np.dot (sgeemm) will be a lot faster.
Timings
Timings for isnan
This function is a completely other implementation. It is much faster if a NaN is quite at the beginning of the array, but anyway even if not there is some speedup. I benchmarked it with small arrays (approx. window size) and a large size suggested by @user3666197.