How to find the pairwise differences between rows

2020-01-29 18:09发布

问题:

Given two matrices, I want to compute the pairwise differences between all rows. Each matrix has 1000 rows and 100 columns so they are fairly large. I tried using a for loop and pure broadcasting but the for loop seem to be working faster. Am I doing something wrong? Here is the code:

from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)

start = time.time()
for a in A:
   sum((a - B)**2,1)
print time.time() - start

# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start

The broadcasting method takes about 1 second longer and it's even longer for large matrices. Any idea how to speed this up purely using numpy?

回答1:

Here's another way to perform :

(a-b)^2 = a^2 + b^2 - 2ab

with np.einsum for the first two terms and dot-product for the third one -

import numpy as np

np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)

Runtime test

Approaches -

def loopy_app(A,B):
    m,n = A.shape[0], B.shape[0]
    out = np.empty((m,n))
    for i,a in enumerate(A):
       out[i] = np.sum((a - B)**2,1)
    return out

def broadcasting_app(A,B):
    return ((A[:,np.newaxis,:] - B)**2).sum(-1)

# @Paul Panzer's soln
def outer_sum_dot_app(A,B):
    return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)

# @Daniel Forsman's soln
def einsum_all_app(A,B):
    return np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], \
                                        A[:,None,:] - B[None,:,:])

# Proposed in this post
def outer_einsum_dot_app(A,B):
    return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - \
                                                            2*np.dot(A,B.T)

Timings -

In [51]: A = np.random.randn(1000,100)
    ...: B = np.random.randn(1000,100)
    ...: 

In [52]: %timeit loopy_app(A,B)
    ...: %timeit broadcasting_app(A,B)
    ...: %timeit outer_sum_dot_app(A,B)
    ...: %timeit einsum_all_app(A,B)
    ...: %timeit outer_einsum_dot_app(A,B)
    ...: 
10 loops, best of 3: 136 ms per loop
1 loops, best of 3: 302 ms per loop
100 loops, best of 3: 8.51 ms per loop
1 loops, best of 3: 341 ms per loop
100 loops, best of 3: 8.38 ms per loop


回答2:

Here is a solution which avoids both the loop and the large intermediates:

from numpy import *
import time

A = random.randn(1000,100)
B = random.randn(1000,100)

start = time.time()
for a in A:
   sum((a - B)**2,1)
print time.time() - start

# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start

#matmul
start = time.time()
add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*dot(A,B.T)
print time.time() - start

Prints:

0.546781778336
0.674743175507
0.10723400116


回答3:

Another job for np.einsum

np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], A[:,None,:] - B[None,:,:])


回答4:

Similar to @paul-panzer, a general way to compute pairwise differences of arrays of arbitrary dimension is broadcasting as follows:

Let v be a NumPy array of size (n, d):

import numpy as np
v_tiled_across = np.tile(v[:, np.newaxis, :], (1, v.shape[0], 1))
v_tiled_down = np.tile(v[np.newaxis, :, :], (v.shape[0], 1, 1))
result = v_tiled_across - v_tiled_down

For a better picture what's happening, imagine each d-dimensional row of v being propped up like a flagpole, and copied across and down. Now when you do component-wise subtraction, you're getting each pairwise combination.

__

There's also scipy.spatial.distance.pdist, which computes a metric in a pairwise fashion.

from scipy.spatial.distance import pdist, squareform
pairwise_L2_norms = squareform(pdist(v))