I'm working on calculating convolutions (cross-correlation) of 3D images. Due to the nature of the problem, FFT based approximations of convolution (e.g. scipy fftconvolve) is not desired, and the "direct sum" is the way to go. The images are ~(150, 150, 150) in size, and the largest kernels are ~(40, 40, 40) in size. the images are periodic (has periodic boundary condition, or needs to be padded by the same image) since ~100 such convolutions has to be done for one analysis, the speed of the convolution function is critical.
I have implemented and tested several functions, including the scipy implementation of convolve with "method = direct", and the results are shown below. I used a (100, 100, 100) image and a (7, 7, 7) kernel to benchmark the methods here:
import numpy as np
import time
from scipy import signal
image = np.random.rand(Nx,Ny,Nz)
kernel = np.random.rand(3,5,7)
signal.convolve(image,kernel, mode='same',method = "direct")
took: 8.198s
I then wrote my own function based on array addition
def shift_array(array, a,b,c):
A = np.roll(array,a,axis = 0)
B = np.roll(A,b,axis = 1)
C = np.roll(B,c,axis = 2)
return C
def matrix_convolve2(image,kernel, mode = "periodic"):
if mode not in ["periodic"]:
raise NotImplemented
if mode is "periodic":
Nx, Ny, Nz = image.shape
nx, ny, nz = kernel.shape
rx = nx//2
ry = ny//2
rz = nz//2
result = np.zeros((Nx, Ny, Nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
result += kernel[i,j,k] * shift_array(image, rx-i, ry-j, rz-k)
return result
matrix_convolve2(image,kernel)
took: 6.324s
It seems in this case the limiting factor here is the np.roll function for periodic boundary condition, so I tried to circumvented this by tilling up the input image
def matrix_convolve_center(image,kernel):
# Only get convolve result for the "central" block
nx, ny, nz = kernel.shape
rx = nx//2
ry = ny//2
rz = nz//2
result = np.zeros((Nx, Ny, Nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
result += kernel[i,j,k] * image[Nx+i-rx:2*Nx+i-rx,Ny+j-ry:2*Ny+j-ry,Nz+k-rz:2*Nz+k-rz]
return result
def matrix_convolve3(image,kernel):
Nx, Ny, Nz = image.shape
nx, ny, nz = kernel.shape
extended_image = np.tile(image,(3,3,3))
result = matrix_convolve_center(extended_image,kernel,Nx, Ny, Nz)
return result
matrix_convolve3(image,kernel)
took: 2.639s
This approach gives the best performance so far, but still way too slow for actual application.
I did some research, and it seems using "Numba" could significantly improve the performance, or maybe write the same function in a parallel way could help too, but I'm not farmiliar with Numba, nor python parallelization (I had some bad experience with the multiprocess
library... it seemed to skip iterations or suddenly stop sometimes)
Could you guys help me here? Any improvement would be greatly appreciated. Thanks a lot!
This is far from conclusive but for the examples I checked fft
is indeed more accurate than naive (sequential) summation. So, unless you have good reason to believe that your data are somehow different, my recommendation would be: Save yourself the trouble and use fft
.
UPDATE: Added my own direct method, taking care to ensure it uses pairwise summation. This manages to be a bit more accurate than fft, but is still very slow.
Test script:
import numpy as np
from scipy import stats, signal, fftpack
def matrix_convolve_center(image,kernel,Nx,Ny,Nz):
# Only get convolve result for the "central" block
nx, ny, nz = kernel.shape
rx = nx//2
ry = ny//2
rz = nz//2
result = np.zeros((Nx, Ny, Nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
result += kernel[i,j,k] * image[Nx+i-rx:2*Nx+i-rx,Ny+j-ry:2*Ny+j-ry,Nz+k-rz:2*Nz+k-rz]
return result
def matrix_convolve3(image,kernel):
Nx, Ny, Nz = image.shape
nx, ny, nz = kernel.shape
extended_image = np.tile(image,(3,3,3))
result = matrix_convolve_center(extended_image,kernel,Nx, Ny, Nz)
return result
P=0 # parity
CH=10 # chunk size
# make integer example, so exact soln is readily available
image = np.random.randint(0,100,(8*CH+P,8*CH+P,8*CH+P))
kernel = np.random.randint(0,100,(2*CH+P,2*CH+P,2*CH+P))
kerpad = np.zeros_like(image)
kerpad[3*CH:-3*CH,3*CH:-3*CH,3*CH:-3*CH]=kernel[::-1,::-1,::-1]
cexa = np.round(fftpack.fftshift(fftpack.ifftn(fftpack.fftn(fftpack.ifftshift(image))*fftpack.fftn(fftpack.ifftshift(kerpad)))).real).astype(int)
# sanity check
assert cexa.sum() == kernel.sum() * image.sum()
# normalize to preclude integer arithmetic during the actual test
image = image / image.sum()
kernel = kernel / kernel.sum()
cexa = cexa / cexa.sum()
# fft method
kerpad = np.zeros_like(image)
kerpad[3*CH:-3*CH,3*CH:-3*CH,3*CH:-3*CH]=kernel[::-1,::-1,::-1]
cfft = fftpack.fftshift(fftpack.ifftn(fftpack.fftn(fftpack.ifftshift(image))*fftpack.fftn(fftpack.ifftshift(kerpad))))
def direct_pp(image,kernel):
nx,ny,nz = image.shape
kx,ky,kz = kernel.shape
out = np.zeros_like(image)
image = np.concatenate([image[...,-kz//2+1:],image,image[...,:kz//2+P]],axis=2)
image = np.concatenate([image[:,-ky//2+1:],image,image[:,:ky//2+P]],axis=1)
image = np.concatenate([image[-kx//2+1:],image,image[:kx//2+P]],axis=0)
mx,my,mz = image.shape
ox,oy,oz = 2*mx-nx,2*my-ny,2*mz-nz
aux = np.empty((ox,oy,kx,ky),image.dtype)
s0,s1,s2,s3 = aux.strides
aux2 = np.lib.stride_tricks.as_strided(aux[kx-1:,ky-1:],(mx,my,kx,ky),(s0,s1,s2-s0,s3-s1))
for z in range(nz):
aux2[...] = np.einsum('ijm,klm',image[...,z:z+kz],kernel)
out[...,z] = aux[kx-1:kx-1+nx,ky-1:ky-1+ny].sum((2,3))
return out
# direct methods
print("How about a coffee? (This may take some time...)")
from time import perf_counter as pc
T = []
T.append(pc())
cdirpp = direct_pp(image,kernel)
T.append(pc())
cdir = np.roll(matrix_convolve3(image,kernel),P-1,(0,1,2))
T.append(pc())
# compare squared error
nrm = (cexa**2).sum()
print('accuracy')
print('fft ',((cexa-cfft)*(cexa-cfft.conj())).real.sum()/nrm)
print('direct',((cexa-cdir)**2).sum()/nrm)
print('dir pp',((cexa-cdirpp)**2).sum()/nrm)
print('duration direct methods')
print('pp {} OP {}'.format(*np.diff(T)))
Sample run:
How about a coffee? (This may take some time...)
accuracy
fft 5.690597572945596e-32
direct 8.518853759493871e-30
dir pp 1.3317651721034386e-32
duration direct methods
pp 5.817311848048121 OP 20.05021938495338