Given two matrices, I want to compute the pairwise differences between all rows. Each matrix has 1000 rows and 100 columns so they are fairly large. I tried using a for loop and pure broadcasting but the for loop seem to be working faster. Am I doing something wrong? Here is the code:
from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
The broadcasting method takes about 1 second longer and it's even longer for large matrices. Any idea how to speed this up purely using numpy?
Here's another way to perform :
with
np.einsum
for the first two terms anddot-product
for the third one -Runtime test
Approaches -
Timings -
Another job for
np.einsum
Here is a solution which avoids both the loop and the large intermediates:
Prints:
Similar to @paul-panzer, a general way to compute pairwise differences of arrays of arbitrary dimension is broadcasting as follows:
Let v be a NumPy array of size (n, d):
For a better picture what's happening, imagine each d-dimensional row of v being propped up like a flagpole, and copied across and down. Now when you do component-wise subtraction, you're getting each pairwise combination.
__
There's also scipy.spatial.distance.pdist, which computes a metric in a pairwise fashion.