Python Fast Implementation of Convolution/Cross-co

I'm working on calculating convolutions (cross-correlation) of 3D images. Due to the nature of the problem, FFT based approximations of convolution (e.g. scipy fftconvolve) is not desired, and the "direct sum" is the way to go. The images are ~(150, 150, 150) in size, and the largest kernels are ~(40, 40, 40) in size. the images are periodic (has periodic boundary condition, or needs to be padded by the same image) since ~100 such convolutions has to be done for one analysis, the speed of the convolution function is critical.

I have implemented and tested several functions, including the scipy implementation of convolve with "method = direct", and the results are shown below. I used a (100, 100, 100) image and a (7, 7, 7) kernel to benchmark the methods here:

import numpy as np
import time
from scipy import signal
image = np.random.rand(Nx,Ny,Nz)
kernel = np.random.rand(3,5,7)

signal.convolve(image,kernel, mode='same',method = "direct")

took: 8.198s

I then wrote my own function based on array addition

def shift_array(array, a,b,c):
    A = np.roll(array,a,axis = 0)
    B = np.roll(A,b,axis = 1)
    C = np.roll(B,c,axis = 2)
    return C

def matrix_convolve2(image,kernel, mode = "periodic"):
    if mode not in ["periodic"]:
        raise NotImplemented
    if mode is "periodic":
        Nx, Ny, Nz = image.shape
        nx, ny, nz = kernel.shape
        rx = nx//2
        ry = ny//2
        rz = nz//2
        result = np.zeros((Nx, Ny, Nz))
        for i in range(nx):
            for j in range(ny):
                for k in range(nz):
                    result += kernel[i,j,k] * shift_array(image, rx-i, ry-j, rz-k) 
        return result


matrix_convolve2(image,kernel)

took: 6.324s

It seems in this case the limiting factor here is the np.roll function for periodic boundary condition, so I tried to circumvented this by tilling up the input image

def matrix_convolve_center(image,kernel):
    # Only get convolve result for the "central" block
    nx, ny, nz = kernel.shape
    rx = nx//2
    ry = ny//2
    rz = nz//2
    result = np.zeros((Nx, Ny, Nz))
    for i in range(nx):
        for j in range(ny):
            for k in range(nz):
                result += kernel[i,j,k] * image[Nx+i-rx:2*Nx+i-rx,Ny+j-ry:2*Ny+j-ry,Nz+k-rz:2*Nz+k-rz]
    return result

def matrix_convolve3(image,kernel):

    Nx, Ny, Nz = image.shape
    nx, ny, nz = kernel.shape

    extended_image = np.tile(image,(3,3,3))
    result = matrix_convolve_center(extended_image,kernel,Nx, Ny, Nz)
    return result

matrix_convolve3(image,kernel)

took: 2.639s

This approach gives the best performance so far, but still way too slow for actual application.

I did some research, and it seems using "Numba" could significantly improve the performance, or maybe write the same function in a parallel way could help too, but I'm not farmiliar with Numba, nor python parallelization (I had some bad experience with the multiprocess library... it seemed to skip iterations or suddenly stop sometimes)

Could you guys help me here? Any improvement would be greatly appreciated. Thanks a lot!

This is far from conclusive but for the examples I checked fft is indeed more accurate than naive (sequential) summation. So, unless you have good reason to believe that your data are somehow different, my recommendation would be: Save yourself the trouble and use fft.

UPDATE: Added my own direct method, taking care to ensure it uses pairwise summation. This manages to be a bit more accurate than fft, but is still very slow.

Test script:

import numpy as np
from scipy import stats, signal, fftpack

def matrix_convolve_center(image,kernel,Nx,Ny,Nz):
    # Only get convolve result for the "central" block
    nx, ny, nz = kernel.shape
    rx = nx//2
    ry = ny//2
    rz = nz//2
    result = np.zeros((Nx, Ny, Nz))
    for i in range(nx):
        for j in range(ny):
            for k in range(nz):
                result += kernel[i,j,k] * image[Nx+i-rx:2*Nx+i-rx,Ny+j-ry:2*Ny+j-ry,Nz+k-rz:2*Nz+k-rz]
    return result

def matrix_convolve3(image,kernel):

    Nx, Ny, Nz = image.shape
    nx, ny, nz = kernel.shape

    extended_image = np.tile(image,(3,3,3))
    result = matrix_convolve_center(extended_image,kernel,Nx, Ny, Nz)
    return result

P=0   # parity
CH=10 # chunk size

# make integer example, so exact soln is readily available
image = np.random.randint(0,100,(8*CH+P,8*CH+P,8*CH+P))
kernel = np.random.randint(0,100,(2*CH+P,2*CH+P,2*CH+P))
kerpad = np.zeros_like(image)
kerpad[3*CH:-3*CH,3*CH:-3*CH,3*CH:-3*CH]=kernel[::-1,::-1,::-1]
cexa = np.round(fftpack.fftshift(fftpack.ifftn(fftpack.fftn(fftpack.ifftshift(image))*fftpack.fftn(fftpack.ifftshift(kerpad)))).real).astype(int)
# sanity check
assert cexa.sum() == kernel.sum() * image.sum()

# normalize to preclude integer arithmetic during the actual test
image = image / image.sum()
kernel = kernel / kernel.sum()
cexa = cexa / cexa.sum()

# fft method
kerpad = np.zeros_like(image)
kerpad[3*CH:-3*CH,3*CH:-3*CH,3*CH:-3*CH]=kernel[::-1,::-1,::-1]
cfft = fftpack.fftshift(fftpack.ifftn(fftpack.fftn(fftpack.ifftshift(image))*fftpack.fftn(fftpack.ifftshift(kerpad))))

def direct_pp(image,kernel):
    nx,ny,nz = image.shape
    kx,ky,kz = kernel.shape
    out = np.zeros_like(image)
    image = np.concatenate([image[...,-kz//2+1:],image,image[...,:kz//2+P]],axis=2)
    image = np.concatenate([image[:,-ky//2+1:],image,image[:,:ky//2+P]],axis=1)
    image = np.concatenate([image[-kx//2+1:],image,image[:kx//2+P]],axis=0)
    mx,my,mz = image.shape
    ox,oy,oz = 2*mx-nx,2*my-ny,2*mz-nz
    aux = np.empty((ox,oy,kx,ky),image.dtype)
    s0,s1,s2,s3 = aux.strides
    aux2 = np.lib.stride_tricks.as_strided(aux[kx-1:,ky-1:],(mx,my,kx,ky),(s0,s1,s2-s0,s3-s1))
    for z in range(nz):
        aux2[...] = np.einsum('ijm,klm',image[...,z:z+kz],kernel)
        out[...,z] = aux[kx-1:kx-1+nx,ky-1:ky-1+ny].sum((2,3))
    return out

# direct methods
print("How about a coffee? (This may take some time...)")

from time import perf_counter as pc

T = []
T.append(pc())
cdirpp = direct_pp(image,kernel)
T.append(pc())
cdir = np.roll(matrix_convolve3(image,kernel),P-1,(0,1,2))
T.append(pc())
# compare squared error
nrm = (cexa**2).sum()
print('accuracy')
print('fft   ',((cexa-cfft)*(cexa-cfft.conj())).real.sum()/nrm)
print('direct',((cexa-cdir)**2).sum()/nrm)
print('dir pp',((cexa-cdirpp)**2).sum()/nrm)
print('duration direct methods')
print('pp {} OP {}'.format(*np.diff(T)))

Sample run:

How about a coffee? (This may take some time...)
accuracy
fft    5.690597572945596e-32
direct 8.518853759493871e-30
dir pp 1.3317651721034386e-32
duration direct methods
pp 5.817311848048121 OP 20.05021938495338