I am implementing an algorithm for Texture Synthesis as outlined here. For this I need to calculate the Sum of Squared Differences, a metric to estimate the error between the template
and different positions across the image
. I have a slow working implementation in place as follows:
total_weight = valid_mask.sum()
for i in xrange(input_image.shape[0]):
for j in xrange(input_image.shape[1]):
sample = image[i:i + window, j:j + window]
dist = (template - sample) ** 2
ssd[i, j] = (dist * valid_mask).sum() / total_weight
Here, total_weight
is just for normalisation. Some pixels have unknown intensities, so I use valid_mask
for masking them. This nested loop lies inside of 2 loops, so that's 4 nested loops which is obviously a performance killer!
Is there a way I can make it faster in NumPy or Python, a replacement for this nested loop? Is Vectorization is possible? I'll need to work on (3, 3)
part of the image
with the (3, 3) of the template
.
I am subsequently going to implement this in Cython, so the faster I can get it to work using just NumPy, better it is.
You can find the complete code here. Line 62 - 67 quoted here.
Thanks,
Chintak
This is basically an improvement over Warren Weckesser's answer. The way to go is clearly with a multidimensional windowed view of the original array, but you want to keep that view from triggering a copy. If you expand your sum((a-b)**2)
, you can turn it into sum(a**2) + sum(b**2) - 2*sum(a*b)
, and this multiply-then-reduce-with-a-sum operations you can perform with linear algebra operators, with a substantial improvement in both performance and memory use:
def sumsqdiff3(input_image, template):
window_size = template.shape
y = as_strided(input_image,
shape=(input_image.shape[0] - window_size[0] + 1,
input_image.shape[1] - window_size[1] + 1,) +
window_size,
strides=input_image.strides * 2)
ssd = np.einsum('ijkl,kl->ij', y, template)
ssd *= - 2
ssd += np.einsum('ijkl, ijkl->ij', y, y)
ssd += np.einsum('ij, ij', template, template)
return ssd
In [288]: img = np.random.rand(500, 500)
In [289]: template = np.random.rand(3, 3)
In [290]: %timeit a = sumsqdiff2(img, template) # Warren's function
10 loops, best of 3: 59.4 ms per loop
In [291]: %timeit b = sumsqdiff3(img, template)
100 loops, best of 3: 18.2 ms per loop
In [292]: np.allclose(a, b)
Out[292]: True
I have left the valid_mask
parameter out on purpose, because I don't fully understand how you would use it. In principle, just zeroing the corresponding values in template
and/or input_image
should do the same trick.
You can do some amazing things with the as_strided
function combined with numpy's broadcasting. Here are two versions of your function:
import numpy as np
from numpy.lib.stride_tricks import as_strided
def sumsqdiff(input_image, template, valid_mask=None):
if valid_mask is None:
valid_mask = np.ones_like(template)
total_weight = valid_mask.sum()
window_size = template.shape
ssd = np.empty((input_image.shape[0] - window_size[0] + 1,
input_image.shape[1] - window_size[1] + 1))
for i in xrange(ssd.shape[0]):
for j in xrange(ssd.shape[1]):
sample = input_image[i:i + window_size[0], j:j + window_size[1]]
dist = (template - sample) ** 2
ssd[i, j] = (dist * valid_mask).sum()
return ssd
def sumsqdiff2(input_image, template, valid_mask=None):
if valid_mask is None:
valid_mask = np.ones_like(template)
total_weight = valid_mask.sum()
window_size = template.shape
# Create a 4-D array y, such that y[i,j,:,:] is the 2-D window
# input_image[i:i+window_size[0], j:j+window_size[1]]
y = as_strided(input_image,
shape=(input_image.shape[0] - window_size[0] + 1,
input_image.shape[1] - window_size[1] + 1,) +
window_size,
strides=input_image.strides * 2)
# Compute the sum of squared differences using broadcasting.
ssd = ((y - template) ** 2 * valid_mask).sum(axis=-1).sum(axis=-1)
return ssd
Here's an ipython session to compare them.
The template that I'll use for the demo:
In [72]: template
Out[72]:
array([[-1, 1, -1],
[ 1, 2, 1],
[-1, 1, -1]])
A small input so we can inspect the result:
In [73]: x
Out[73]:
array([[ 0., 1., 2., 3., 4., 5., 6.],
[ 7., 8., 9., 10., 11., 12., 13.],
[ 14., 15., 16., 17., 18., 19., 20.],
[ 21., 22., 23., 24., 25., 26., 27.],
[ 28., 29., 30., 31., 32., 33., 34.]])
Apply the two functions to x
and check that we get the same result:
In [74]: sumsqdiff(x, template)
Out[74]:
array([[ 856., 1005., 1172., 1357., 1560.],
[ 2277., 2552., 2845., 3156., 3485.],
[ 4580., 4981., 5400., 5837., 6292.]])
In [75]: sumsqdiff2(x, template)
Out[75]:
array([[ 856., 1005., 1172., 1357., 1560.],
[ 2277., 2552., 2845., 3156., 3485.],
[ 4580., 4981., 5400., 5837., 6292.]])
Now make a much bigger input "image":
In [76]: z = np.random.randn(500, 500)
and check the performance:
In [77]: %timeit sumsqdiff(z, template)
1 loops, best of 3: 3.55 s per loop
In [78]: %timeit sumsqdiff2(z, template)
10 loops, best of 3: 33 ms per loop
Not too shabby. :)
Two drawbacks:
- The calculation in
sumsqdiff2
will generate a temporary array that, for a 3x3 template, will be 9 times the size of input_image
. (In general it will be template.size
times the size of input_image
.)
- These "stride tricks" will not help you when you Cythonize the code. When converting to Cython, you often end up putting back in the loops you got rid of when vectorizing with numpy.
I think you did a good job implementing your algorithm. Vectorization is an option but I encourage you to use Numba optimizing compiler which compiles Python syntax to Machine Code. The Numba effect is pretty impressive. Numba vs. Cython: Take 2 is a very brief introduction to Numba and a performance comparison.
It might be worth your while checking how it performs if you rearrange the algorithm to perform the computations row-wise. The idea being that you you might be using the CPU cache better if reading the memory consecutively.
Pseudo code:
for template_row in template:
for row in image:
for col in image:
# find distance template_row to sample_row
# add sum to ssd[row - template_row, col]
Actual code (after Warren's):
def sumsqdiffr(input_image, template, valid_mask=None):
if valid_mask is None:
valid_mask = np.ones_like(template)
total_weight = valid_mask.sum()
window_size = template.shape
ssd = np.zeros((input_image.shape[0] - window_size[0] + 1,
input_image.shape[1] - window_size[1] + 1))
for tr in xrange(template.shape[0]):
for i in xrange(tr, ssd.shape[0] + tr):
for j in xrange(ssd.shape[1]):
sample = input_image[i, j:j + window_size[1]]
dist = (template[tr] - sample) ** 2
ssd[i - tr, j] += (dist * valid_mask[tr]).sum()
return ssd
It's more than two times slower than the original implementation.
(If somebody would like to enlighten me whether the whole idea was wrong or what's causing this, I'd be glad to gain some understanding in it)