I'm a bit stumped by how scipy.spatial.distance.pdist
handles missing (nan
) values.
So just in case I messed up the dimensions of my matrix, let's get that out of the way. From the docs:
The points are arranged as m n-dimensional row vectors in the matrix X.
So let's generate three points in 10 dimensional space with missing values:
numpy.random.seed(123456789)
data = numpy.random.rand(3, 10) * 5
data[data < 1.0] = numpy.nan
If I compute the Euclidean distance of these three observations:
pdist(data, "euclidean")
I get:
array([ nan, nan, nan])
However, if I filter all the columns with missing values I do get proper distance values:
valid = [i for (i, col) in enumerate(data.T) if ~numpy.isnan(col).any()]
pdist(data[:, valid], "euclidean")
I get:
array([ 3.35518662, 2.35481185, 3.10323893])
This way, I throw away more data than I'd like since I don't need to filter the whole matrix but only the pairs of vectors being compared at a time. Can I make pdist
or a similar function perform pairwise masking, somehow?
Edit:
Since my full matrix is rather large, I did some timing tests on the small data set provided here.
1.) The scipy function.
%timeit pdist(data, "euclidean")
10000 loops, best of 3: 24.4 µs per loop
2.) Unfortunately, the solution provided so far is roughly 10 times slower.
%timeit numpy.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()
1000 loops, best of 3: 231 µs per loop
3.) Then I did a test of "pure" Python and was pleasantly surprised:
from scipy.linalg import norm
%%timeit
m = data.shape[0]
dm = numpy.zeros(m * (m - 1) // 2, dtype=float)
mask = numpy.isfinite(data)
k = 0
for i in range(m - 1):
for j in range(i + 1, m):
curr = numpy.logical_and(mask[i], mask[j])
u = data[i][curr]
v = data[j][curr]
dm[k] = norm(u - v)
k += 1
10000 loops, best of 3: 98.9 µs per loop
So I think the way to go forward is to Cythonize the above code in a function.