pdist for theano tensor

2019-01-24 12:30发布

I have a theano symbolic matrix

x = T.fmatrix('input')

x will be later on populated by n vectors of dim d (at train time).

I would like to have the theano equivalent of pdist (scipy.spatial.distance.pdist of pdist), something like

D = theano.pdist( x )

How can I achieve this?

Calling scipy.spatial.distance.pdist on x directly does not work as x at this stage is only symbolic...

Update: I would very much like to be able to mimic pdist "compact" behavior: that is, computing only ~1/2 of the nxn entries of the distance matrix.

2条回答
做自己的国王
2楼-- · 2019-01-24 13:16

I haven't worked with Theano before, but here is a solution based on pure Numpy functions (perhaps you convert it to the equivalent theano functions. Note that I'm using automatic broadcasting in the expression below, so you might have to rewrite that explicitly if Theano doesn't support it):

# X is an m-by-n matrix (rows are examples, columns are dimensions)
# D is an m-by-m symmetric matrix of pairwise Euclidean distances
a = np.sum(X**2, axis=1)
D = np.sqrt((a + a[np.newaxis].T) - 2*np.dot(X, X.T))

It is based on the fact that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v. (I showed this in previous answers of mine using MATLAB)

Here is a comparison against Scipy existing functions:

import numpy as np
from scipy.spatial.distance import pdist, squareform

def my_pdist(X):
    a = np.sum(X**2, axis=1)
    D = np.sqrt((a + a[np.newaxis].T) - 2*np.dot(X, X.T))
    return D

def scipy_pdist(X):
    D = squareform(pdist(X, metric='euclidean'))
    return D    

X = np.random.rand(5, 3)
D1 = my_pdist(X)
D2 = scipy_pdist(X)

The difference should be negligible, close to machine epsilon (np.spacing(1)):

>>> np.linalg.norm(D1-D2)
8.5368137554718277e-16

HTH


EDIT:

Here is another implementation with a single loop:

def my_pdist_compact(X):
    D = np.empty(shape=[0,0], dtype=X.dtype)
    for i in range(X.shape[0]-1):
        D = np.append(D, np.sqrt(np.sum((X[i,] - X[i+1:,])**2, axis=1)))
    return D

Somewhat equivalent MATLAB code:

function D = my_pdist_compact(X)
    n = size(X,1);
    D = cell(n-1,1);
    for i=1:n-1
        D{i} = sqrt(sum(bsxfun(@minus, X(i,:), X(i+1:end,:)).^2, 2));
    end
    D = vertcat(D{:});
end

This returns the pairwise-distances in compact form (upper triangular part of the symmetric matrix). This is the same output as pdist. Use squareform to convert it to full matrix.

>>> d1 = my_pdist_compact(X)
>>> d2 = pdist(X)    # from scipy.spatial.distance
>>> (d1 == d2).all()
True

I will leave it to you to see if it's possible to write the equivalent loop using Theano (see theano.scan)!

查看更多
唯我独甜
3楼-- · 2019-01-24 13:30

pdist from scipy is a collection of different functions - there doesn't exist a Theano equivalent for all of them at once. However, each specific distance, being a closed form mathematical expression, can be written down in Theano as such and then compiled.

Take as a example the minkowski p norm distance (copy+pasteable):

import theano
import theano.tensor as T
X = T.fmatrix('X')
Y = T.fmatrix('Y')
P = T.scalar('P')
translation_vectors = X.reshape((X.shape[0], 1, -1)) - Y.reshape((1, Y.shape[0], -1))
minkowski_distances = (abs(translation_vectors) ** P).sum(2) ** (1. / P)
f_minkowski = theano.function([X, Y, P], minkowski_distances)

Note that abs calls the built-in __abs__, so abs is also a theano function. We can now compare this to pdist:

import numpy as np
from scipy.spatial.distance import pdist

rng = np.random.RandomState(42)
d = 20 # dimension
nX = 10
nY = 30
x = rng.randn(nX, d).astype(np.float32)
y = rng.randn(nY, d).astype(np.float32)

ps = [1., 3., 2.]

for p in ps:
    d_theano = f_minkowski(x, x, p)[np.triu_indices(nX, 1)]
    d_scipy = pdist(x, p=p, metric='minkowski')
    print "Testing p=%1.2f, discrepancy %1.3e" % (p, np.sqrt(((d_theano - d_scipy) ** 2).sum()))

This yields

Testing p=1.00, discrepancy 1.322e-06
Testing p=3.00, discrepancy 4.277e-07
Testing p=2.00, discrepancy 4.789e-07

As you can see, the correspondence is there, but the function f_minkowski is slightly more general, since it compares the lines of two possibly different arrays. If twice the same array is passed as input, f_minkowski returns a matrix, whereas pdist returns a list without redundancy. If this behaviour is desired, it can also be implemented fully dynamically, but I will stick to the general case here.

One possibility of specialization should be noted though: In the case of p=2, the calculations become simpler through the binomial formula, and this can be used to save precious space in memory: Whereas the general Minkowski distance, as implemented above, creates a 3D array (due to avoidance of for-loops and summing cumulatively), which is prohibitive, depending on the dimension d (and nX, nY), for p=2 we can write

squared_euclidean_distances = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (Y ** 2).sum(1).reshape((1, Y.shape[0])) - 2 * X.dot(Y.T)
f_euclidean = theano.function([X, Y], T.sqrt(squared_euclidean_distances))

which only uses O(nX * nY) space instead of O(nX * nY * d) We check for correspondence, this time on the general problem:

d_eucl = f_euclidean(x, y)
d_minkowski2 = f_minkowski(x, y, 2.)
print "Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy %1.3e" % ((d_eucl - d_minkowski2) ** 2).sum()

yielding

Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy 1.464e-11
查看更多
登录 后发表回答