Python: Running nested loop, 2D moving window, in

2020-03-04 08:25发布

问题:

I work with topographic data. For one particular problem, I have written a function in Python which uses a moving window of a particular size to zip through a matrix (grid of elevations). Then I have to perform an analysis on this window and set the cell at the center of the window a resulting value.

My final output is a matrix the same size as my original matrix which has been altered according to my analysis. This problem takes 11 hours to run on a small area, so I thought parallelizing the inner loop would accelerate things. Alternatively, there may be a clever vectorization solution too...

See my function below, DEM is a 2D numpy array, w is the size of the window.

def RMSH_det(DEM, w):
    import numpy as np
    from scipy import signal
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

#    nw=(w*2)**2
#    x = np.arange(0,nw)

    for i in np.arange(w+1,nrows-w):


        for j in np.arange(w+1,ncols-w):

            d1 = np.int64(np.arange(i-w,i+w))
            d2 = np.int64(np.arange(j-w,j+w))

            win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]

            if np.max(np.isnan(win)) == 1:
                rms[i,j] = np.nan

            else:
                win = signal.detrend(win, type = 'linear')
                z = np.reshape(win,-1)
                nz = np.size(z)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms


    return(rms)

I have scoured SO/SE for solutions to my question and come across many examples of nested for loops and trying to run them in parallel. I've struggled to adapt my code to match the examples and would appreciate some help. A solution to this problem would help me work with several other moving window functions I have.

Thus far, I have moved the inner loop into it's own function, which can be called from within the outer loop:

def inLoop(i, w, DEM,rms,ncols):
        for j in np.arange(w+1,ncols-w):

            d1 = np.int64(np.arange(i-w,i+w))
            d2 = np.int64(np.arange(j-w,j+w))

            win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]

            if np.max(np.isnan(win)) == 1:
                rms[i,j] = np.nan

            else:
                win = signal.detrend(win, type = 'linear')
                z = np.reshape(win,-1)
                nz = np.size(z)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms


        return(rms)

But I wasn't sure of the correct way of coding the call to Pool with the necessary variables that need to be input into the inner loop. See the outerloop below:

 for i in np.arange(w+1,nrows-w):
        number_of_workers = 8

        with Pool(number_of_workers) as p:
            #call the pool
            p.starmap(inLoop, [i, w, DEM, rms, ncols])


Remaining questions:

  • Can this code even be optimized by parallelizing?

  • How do I successfully store the result of a parallelized nested-for loop?

回答1:

A solution using Numba

In some cases this is very easy to do, if all functions which you use are supported. In your code win = signal.detrend(win, type = 'linear') is the part you have to implement in Numba, because this function isn't supported.

Implementing detrend in Numba

If you look at the source-code of detrend, and extract the relevant parts for your problem, it may look like this:

@nb.njit()
def detrend(w):
    Npts=w.shape[0]
    A=np.empty((Npts,2),dtype=w.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    coef, resids, rank, s = np.linalg.lstsq(A, w.T)
    out=w.T- np.dot(A, coef)
    return out.T

I also implemented a faster solution for np.max(np.isnan(win)) == 1

@nb.njit()
def isnan(win):
    for i in range(win.shape[0]):
        for j in range(win.shape[1]):
            if np.isnan(win[i,j]):
                return True
    return False

Main function

As I used Numba here, the parallelization is very simple, just a prange on the outer loop and

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def RMSH_det_nb(DEM, w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    for i in nb.prange(w+1,nrows-w):
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend(win)
                z = win.flatten()
                nz = z.size
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms

    return rms

Timings (small example)

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb(DEM, w) #approx. 55 times faster
#29 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timings for larger arrays

w = 10
DEM=np.random.rand(1355, 1165).astype(np.float32)
%timeit res2=RMSH_det_nb(DEM, w)
#6.63 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[Edit] Implemenation using normal Equations

Overdetermined system

This method has a lower numerical precision. Although this solution is quite a lot faster.

@nb.njit()
def isnan(win):
    for i in range(win.shape[0]):
        for j in range(win.shape[1]):
            if win[i,j]==np.nan:
                return True
    return False

@nb.njit()
def detrend(w):
    Npts=w.shape[0]
    A=np.empty((Npts,2),dtype=w.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    coef, resids, rank, s = np.linalg.lstsq(A, w.T)
    out=w.T- np.dot(A, coef)
    return out.T

@nb.njit()
def detrend_2(w,T1,A):
    T2=np.dot(A.T,w.T)
    coef=np.linalg.solve(T1,T2)

    out=w.T- np.dot(A, coef)

    return out.T

@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq(DEM,w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    Npts=w*2-1
    A=np.empty((Npts,2),dtype=DEM.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    T1=np.dot(A.T,A)

    nz = Npts**2
    for i in nb.prange(w+1,nrows-w):
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend_2(win,T1,A)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
                rms[i,j] = rootms

    return rms

Timings

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq(DEM,w)
#7.97 ms ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Optimized solution using normal equations

Temporary arrays are reused to avoid costly memory allocations and a custom implementation for matrix multiplication is used. This is only recommendable for very small matrices, in most other cases np.dot (sgeemm) will be a lot faster.

@nb.njit()
def matmult_2(A,B,out):
    for j in range(B.shape[1]):
        acc1=nb.float32(0)
        acc2=nb.float32(0)
        for k in range(B.shape[0]):
            acc1+=A[0,k]*B[k,j]
            acc2+=A[1,k]*B[k,j]
        out[0,j]=acc1
        out[1,j]=acc2
    return out

@nb.njit(fastmath=True)
def matmult_mod(A,B,w,out):
    for j in range(B.shape[1]):
        for i in range(A.shape[0]):
            acc=nb.float32(0)
            acc+=A[i,0]*B[0,j]+A[i,1]*B[1,j]
            out[j,i]=acc-w[j,i]
    return out

@nb.njit()
def detrend_2_opt(w,T1,A,Tempvar_1,Tempvar_2):
    T2=matmult_2(A.T,w.T,Tempvar_1)
    coef=np.linalg.solve(T1,T2)
    return matmult_mod(A, coef,w,Tempvar_2)

@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq_opt(DEM,w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    Npts=w*2-1
    A=np.empty((Npts,2),dtype=DEM.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    T1=np.dot(A.T,A)

    nz = Npts**2
    for i in nb.prange(w+1,nrows-w):
        Tempvar_1=np.empty((2,Npts),dtype=DEM.dtype)
        Tempvar_2=np.empty((Npts,Npts),dtype=DEM.dtype)
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend_2_opt(win,T1,A,Tempvar_1,Tempvar_2)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
                rms[i,j] = rootms

    return rms

Timings

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb_normal_eq_opt(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq_opt(DEM,w)
#4.66 ms ± 87.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timings for isnan

This function is a completely other implementation. It is much faster if a NaN is quite at the beginning of the array, but anyway even if not there is some speedup. I benchmarked it with small arrays (approx. window size) and a large size suggested by @user3666197.

case_1=np.full((20,20),np.nan)
case_2=np.full((20,20),0.)
case_2[10,10]=np.nan
case_3=np.full((20,20),0.)

case_4 = np.full( ( int( 1E4 ), int( 1E4 ) ),np.nan)
case_5 = np.ones( ( int( 1E4 ), int( 1E4 ) ) )

%timeit np.any(np.isnan(case_1))
%timeit np.any(np.isnan(case_2))
%timeit np.any(np.isnan(case_3))
%timeit np.any(np.isnan(case_4))
%timeit np.any(np.isnan(case_5))
#2.75 µs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.75 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.76 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#81.3 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#86.7 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit isnan(case_1)
%timeit isnan(case_2)
%timeit isnan(case_3)
%timeit isnan(case_4)
%timeit isnan(case_5)
#244 ns ± 5.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#357 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#475 ns ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#235 ns ± 0.933 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#58.8 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


回答2:

Q : This problem takes 11 hours to run on a small area,... stay tuned, we can and we'll get under 20 [min] !!

given due explanations were provided, for which I thank the O/P author:

# DEM.shape = [nrows, ncols] = [ 1355, 1165 ]
# DEM.dtype = float32 
#    .flags = C_CONTIGUOUS    : True
#             F_CONTIGUOUS    : False
#             OWNDATA         : True
#             WRITEABLE       : True
#             ALIGNED         : True
#             WRITEBACKIFCOPY : False
#             UPDATEIFCOPY    : False

I tried to review the code and setup a mock-up of a bit more efficient code, before moving into putting all popular and ready-to-use numpy + numba steroids in, and the interim numpy-only result works
on a sample of [100,100] DEM-grid for about ~ 6 [s] at the said kernel window width w = 10

The same, for [200,200] DEM-grid, takes under ~ 36 [s] - obviously, the scaling is ~ O( N^2 )

The same, for [1000,1000] DEM-grid, took under ~ 1077 [s] ~ 17.6 [min] wow !

A field .jit trial on [1000,1000] DEM-grid is currently under the test and will update the post once finished + once the numba.jit() code will enjoy to run the further accelerated results


So far, quite promising, isn't it?

If you @morrismc test your as-is code now, on a [100,100]-matrix, we can already guess the achieved range of the principal speedup, even before running tests are completed.

>>> pass;    import numpy as np
>>> from zmq import Stopwatch; clk = Stopwatch()
>>>
>>> size =  100; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
      6492192 [us]
NumOf_np.nan-s was 0

>>> size =  200; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
     35650629 [us]
NumOf_np.nan-s was 0

>>> size = 1000; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
   1058702889 [us]
NumOf_np.nan-s was 0

All this on scipy 1.2.1, thus without the benefits from 1.3.1 possible further speedups


A numba.jit() LLVM-compiled code. Ooops, slower ?

numba.jit()-acceleration has shown about 200 [ms] worse runtime on [100,100] DEM-grid, with signature having been specified ( so no ad-hoc analysis costs were accrued here ) and nogil = True ( '0.43.1+0.g8dabe7abe.dirty' not being the most recent, yet )

Guess there is nothing more to gain here, without moving the game into compiled Cython territories, yet having about tens of minutes instead of tens of hours, the Alea Iacta Est - just the numpy smart-vectorised code rulez!


EPILOGUE :

If the original algorithm was correct (and some doubts were left in the source-code for any further improvement work), any attempt to run some other form of [PARALLEL] code-execution-flow will not help here ( kernel-windows[w,w] are very-small and non-contiguous areas of the DEM-grid memory-layout, memory-I/O costs are dominant part of the run-time budget here, and some nicer indexing may improve the cache-line re-use, yet the overall efforts are well beyond the budget, as the target of going down from ~ 11 [hrs] to about ~ 6 [hrs] was more than successfully met with about ~ 20 [min] runtimes achievable for [1300,1100] float32 DEM-grids

The code was left as-is ( non-PEP-8 ), because of all the add-on didactic value for the [DOC.me], [TEST.me] and [PERF.me] phases of QA, so all kind PEP-isto-evangelisators do bear with the O/P author's view onto a full-scree-width layout left, so as to allow to understand WHY and to improve the code, which with stripped-off-comments would lose her/his way forwards in improving the code performance further on. Thx.

@jit( [ "int32( float32[:,:], int32, float32[:,:] )", ], nogil    = True )                  # numba.__version__ '0.43.1+0.g8dabe7abe.dirty'
def RMSH_det_jit( DEMf32, w, rmsRESULTf32 ):                            # pre-allocate rmsRESULTf32[:,:] externally
    #import numpy as np
    #from scipy import signal
    #
    # [nrows, ncols] = np.shape( DEM )                                  # avoid ~ [ 1355, 1165 ]
    #                                                                   # DEM.dtype = float32 
    #                                                                   #    .flags = C_CONTIGUOUS    : True
    #                                                                   #             F_CONTIGUOUS    : False
    #                                                                   #             OWNDATA         : True
    #                                                                   #             WRITEABLE       : True
    #                                                                   #             ALIGNED         : True
    #                                                                   #             WRITEBACKIFCOPY : False
    #                                                                   #             UPDATEIFCOPY    : False
    #
    rmsRESULTf32[:,:] = np.nan                                          #        .STO[:,:] np.nan-s, using in-place assignment into the by-ref passed, externally pre-allocated np.ndarray
    dtdWIN            = np.ones( ( 2 * w - 1,                           #        .ALLOC once, re-use 1M+ times
                                   2 * w - 1 ) )
    a_div_by_nz_minus1 = 1. / ( dtdWIN.size - 1  )                      #        .SET float CONST with about a ~1M+ re-use
    a_num_of_NaNs      = 0                                              #        .SET i4 bonus value, ret'd as a side-effect of the signature ... 
    # rms = DEM*np.nan                                                  # avoid ( pre-alloc rmsRESULTf32 ) externally create and pass a right-sized, empty array to store all results
    # nw  = ( w * 2 )**2
    # x   = np.arange( 0, nw )

    #                        11..1344
    #or     i in np.arange( w+1,           nrows-w ):                   # w ~ 10 -> [11:1344, 11:1154]
    for     i in np.arange( w+1, DEMf32.shape[0]-w ):                   #         ??? never touches DEM-row/column[0]?? or off-by-one indexing error ???
        fromI = i - w                                                   #        .UPD ALAP
        tillI = i + w - 1                                               #        .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
        #                    11..1154
        #or j in np.arange( w+1,           ncols-w ):
        for j in np.arange( w+1, DEMf32.shape[1]-w ):
            fromJ = j - w                                               #        .UPD ALAP
            tillJ = j + w - 1                                           #        .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
            #                       1..1334:21..1354                    #         ??? never touches first/last DEM-row/column??
            # d1 = np.int64( np.arange( i-w, i+w ) )                    # AVOID: 1M+ times allocated, yet never consumed, but their edge values
            # d2 = np.int64( np.arange( j-w, j+w ) )                    # AVOID: 1M+ times allocated, yet never consumed, but their edge values

            # win = DEM[ d1[0]:d1[-1],                                  # AVOID: while a .view-only, no need to 1M+ times instantiate a "kernel"-win(dow] ( this will create a np.view into the original DEM, not a copy ! )
            #            d2[0]:d2[-1]                                   # ?.or.?   NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
            #            ]                                              # ?.or.?   NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
            dtdWIN[:,:] = DEMf32[fromI:tillI, fromJ:tillJ]              #          NOT a .view-only, but a     .copy() re-populated into a just once and only once pre-allocated dtdWIN, via an in-place copy
            #f np.max( np.isnan(    win ) ) == 1:                       # AVOID: 1M+ times full-range scan, while any first np.nan decides the game and no need to scan "the rest"
            if np.any( np.isnan( dtdWIN ) ):                            #        "density" of np.nan-s determine, if this is a good idea to pre-store
               a_num_of_NaNs += 1                                       # .INC
               continue                                                 #        .NOP/LOOP from here, already pre-stored np.nan-s for this case
               # rms[i,j] = np.nan                                      # DUP ( already stored in initialisation ... )
            else:
               #in    = signal.detrend(    win, type = 'linear' )       # REALLY?: in-place modification of DEM-matrix ???
               dtdWIN = signal.detrend( dtdWIN, type = 'linear'   )     #    in scipy-v1.3.1+ can mod in-place,   overwrite_data = True ) # REMOVE OLS-fit-linear trend
               dtdWIN = signal.detrend( dtdWIN, type = 'constant' )     #    in scipy-v1.3.1+ can mod in-place,   overwrite_data = True ) # REMOVE mean
               #z  = np.reshape( win, -1 )                              # AVOID:~1M+ re-counting constant value, known from w directly
               #nz = np.size( z )                                       # AVOID:~1M+ re-counting constant value, known from w directly
               #rootms    = np.sqrt( 1 / ( nz - 1 ) * np.sum( ( z - np.mean( z ) )**2 ) )
               #rms[i,j]  = rootms
               rmsRESULTf32[i,j] = np.sqrt( a_div_by_nz_minus1          # .STO a "scaled"
                                          * np.dot(   dtdWIN,
                                                      dtdWIN.T
                                                      ).sum()
                                          # np.sum( ( dtdWIN            #         SUM of
                                          #       # - dtdWIN.mean()     #               mean-removed ( ALREADY done via scipy.signal.detrend( 'const' ) above )
                                          #           )**2              #               SQUARES
                                          #         )
                                            )                           #      ROOT
    return( a_num_of_NaNs )                                             # ret i4