可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am implementing an algorithm for Texture Synthesis as outlined here. For this I need to calculate the Sum of Squared Differences, a metric to estimate the error between the template and different positions across the image. I have a slow working implementation in place as follows:

total_weight = valid_mask.sum()  
for i in xrange(input_image.shape[0]):  
    for j in xrange(input_image.shape[1]):  
        sample = image[i:i + window, j:j + window]  
        dist = (template - sample) ** 2  
        ssd[i, j] = (dist * valid_mask).sum() / total_weight

Here, total_weight is just for normalisation. Some pixels have unknown intensities, so I use valid_mask for masking them. This nested loop lies inside of 2 loops, so that's 4 nested loops which is obviously a performance killer!

Is there a way I can make it faster in NumPy or Python, a replacement for this nested loop? Is Vectorization is possible? I'll need to work on (3, 3) part of the image with the (3, 3) of the template.

I am subsequently going to implement this in Cython, so the faster I can get it to work using just NumPy, better it is.

You can find the complete code here. Line 62 - 67 quoted here.

Thanks,
Chintak

回答1:

This is basically an improvement over Warren Weckesser's answer. The way to go is clearly with a multidimensional windowed view of the original array, but you want to keep that view from triggering a copy. If you expand your sum((a-b)**2), you can turn it into sum(a**2) + sum(b**2) - 2*sum(a*b), and this multiply-then-reduce-with-a-sum operations you can perform with linear algebra operators, with a substantial improvement in both performance and memory use:

def sumsqdiff3(input_image, template):
    window_size = template.shape
    y = as_strided(input_image,
                    shape=(input_image.shape[0] - window_size[0] + 1,
                           input_image.shape[1] - window_size[1] + 1,) +
                          window_size,
                    strides=input_image.strides * 2)
    ssd = np.einsum('ijkl,kl->ij', y, template)
    ssd *= - 2
    ssd += np.einsum('ijkl, ijkl->ij', y, y)
    ssd += np.einsum('ij, ij', template, template)

    return ssd

In [288]: img = np.random.rand(500, 500)

In [289]: template = np.random.rand(3, 3)

In [290]: %timeit a = sumsqdiff2(img, template) # Warren's function
10 loops, best of 3: 59.4 ms per loop

In [291]: %timeit b = sumsqdiff3(img, template)
100 loops, best of 3: 18.2 ms per loop

In [292]: np.allclose(a, b)
Out[292]: True

I have left the valid_mask parameter out on purpose, because I don't fully understand how you would use it. In principle, just zeroing the corresponding values in template and/or input_image should do the same trick.

回答2:

You can do some amazing things with the as_strided function combined with numpy's broadcasting. Here are two versions of your function:

import numpy as np
from numpy.lib.stride_tricks import as_strided


def sumsqdiff(input_image, template, valid_mask=None):
    if valid_mask is None:
        valid_mask = np.ones_like(template)
    total_weight = valid_mask.sum()
    window_size = template.shape
    ssd = np.empty((input_image.shape[0] - window_size[0] + 1,
                    input_image.shape[1] - window_size[1] + 1))
    for i in xrange(ssd.shape[0]):  
        for j in xrange(ssd.shape[1]):  
            sample = input_image[i:i + window_size[0], j:j + window_size[1]]  
            dist = (template - sample) ** 2  
            ssd[i, j] = (dist * valid_mask).sum()
    return ssd


def sumsqdiff2(input_image, template, valid_mask=None):
    if valid_mask is None:
        valid_mask = np.ones_like(template)
    total_weight = valid_mask.sum()
    window_size = template.shape

    # Create a 4-D array y, such that y[i,j,:,:] is the 2-D window
    #     input_image[i:i+window_size[0], j:j+window_size[1]]
    y = as_strided(input_image,
                    shape=(input_image.shape[0] - window_size[0] + 1,
                           input_image.shape[1] - window_size[1] + 1,) +
                          window_size,
                    strides=input_image.strides * 2)

    # Compute the sum of squared differences using broadcasting.
    ssd = ((y - template) ** 2 * valid_mask).sum(axis=-1).sum(axis=-1)
    return ssd

Here's an ipython session to compare them.

The template that I'll use for the demo:

In [72]: template
Out[72]: 
array([[-1,  1, -1],
       [ 1,  2,  1],
       [-1,  1, -1]])

A small input so we can inspect the result:

In [73]: x
Out[73]: 
array([[  0.,   1.,   2.,   3.,   4.,   5.,   6.],
       [  7.,   8.,   9.,  10.,  11.,  12.,  13.],
       [ 14.,  15.,  16.,  17.,  18.,  19.,  20.],
       [ 21.,  22.,  23.,  24.,  25.,  26.,  27.],
       [ 28.,  29.,  30.,  31.,  32.,  33.,  34.]])

Apply the two functions to x and check that we get the same result:

In [74]: sumsqdiff(x, template)
Out[74]: 
array([[  856.,  1005.,  1172.,  1357.,  1560.],
       [ 2277.,  2552.,  2845.,  3156.,  3485.],
       [ 4580.,  4981.,  5400.,  5837.,  6292.]])

In [75]: sumsqdiff2(x, template)
Out[75]: 
array([[  856.,  1005.,  1172.,  1357.,  1560.],
       [ 2277.,  2552.,  2845.,  3156.,  3485.],
       [ 4580.,  4981.,  5400.,  5837.,  6292.]])

Now make a much bigger input "image":

In [76]: z = np.random.randn(500, 500)

and check the performance:

In [77]: %timeit sumsqdiff(z, template)
1 loops, best of 3: 3.55 s per loop

In [78]: %timeit sumsqdiff2(z, template)
10 loops, best of 3: 33 ms per loop

Not too shabby. :)

Two drawbacks:

The calculation in sumsqdiff2 will generate a temporary array that, for a 3x3 template, will be 9 times the size of input_image. (In general it will be template.size times the size of input_image.)
These "stride tricks" will not help you when you Cythonize the code. When converting to Cython, you often end up putting back in the loops you got rid of when vectorizing with numpy.

回答3:

I think you did a good job implementing your algorithm. Vectorization is an option but I encourage you to use Numba optimizing compiler which compiles Python syntax to Machine Code. The Numba effect is pretty impressive. Numba vs. Cython: Take 2 is a very brief introduction to Numba and a performance comparison.

回答4:

It might be worth your while checking how it performs if you rearrange the algorithm to perform the computations row-wise. The idea being that you you might be using the CPU cache better if reading the memory consecutively.

Pseudo code:

for template_row in template:
  for row in image:
    for col in image:
      # find distance template_row to sample_row
      # add sum to ssd[row - template_row, col]

Actual code (after Warren's):

def sumsqdiffr(input_image, template, valid_mask=None):
    if valid_mask is None:
        valid_mask = np.ones_like(template)
    total_weight = valid_mask.sum()
    window_size = template.shape
    ssd = np.zeros((input_image.shape[0] - window_size[0] + 1,
                    input_image.shape[1] - window_size[1] + 1))

    for tr in xrange(template.shape[0]):
        for i in xrange(tr, ssd.shape[0] + tr):
            for j in xrange(ssd.shape[1]):  
                sample = input_image[i, j:j + window_size[1]]  
                dist = (template[tr] - sample) ** 2  
                ssd[i - tr, j] += (dist * valid_mask[tr]).sum()
    return ssd

It's more than two times slower than the original implementation.

(If somebody would like to enlighten me whether the whole idea was wrong or what's causing this, I'd be glad to gain some understanding in it)