Consider a numpy array A of dimensionality NxM. The goal is to compute Euclidean distance matrix D, where each element D[i,j] is Eucledean distance between rows i and j. What is the fastest way of doing it? This is not exactly the problem I need to solve, but it's a good example of what I'm trying to do (in general, other distance metrics could be used).
This is the fastest I could come up with so far:
n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
D[i] = np.sqrt(np.square(A-A[i]).sum(1))
But is it the fastest way? I'm mainly concerned about the for loop. Can we beat this with, say, Cython?
To avoid looping, I tried to use broadcasting, and do something like this:
D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
But it turned out to be a bad idea, because there's some overhead in construction an intermediate 3D array of dimensionality NxNxM, so the performance is worse.
I tried Cython. But I am a newbie in Cython, so I don't know how good is my attempt:
def dist(np.ndarray[np.int32_t, ndim=2] A):
cdef int n = A.shape[0]
cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)
cdef int i = 0
for i in range(n):
dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)
return dm
The above code was a bit slower than Python's for loop. I don't know much about Cython, but I assume I could achieve at least the same performance as the for loop + numpy. And I am wondering whether it is possible to achieve some noticeable performance improvement when done the right way? Or whether there's some other way to speed this up (not involving parallel computations)?
The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.
There's a very useful tutorial here that covers the process of converting numpy code to Cython and optimizing it.
Here's a quick stab at a more optimized Cython version of your distance function:
I saved this in a file called
fastdist.pyx
. We can usepyximport
to simplify the build process:So it works, at least. Let's do some benchmarking using the
%timeit
magic:A ~9x speed-up is nice, but not really a game-changer. As you said, though, the big problem with the broadcasting approach is the memory requirements of constructing the intermediate array.
I wouldn't recommend trying that using broadcasting...
Another thing we could do is parallelize this over the outermost loop, using the
prange
function:In order to compile the parallel version you'll need to tell the compiler to enable OpenMP. I haven't figured out how to do this using
pyximport
, but if you're usinggcc
you could compile it manually like this:With parallelism, using 8 threads: