Write double (triple) sum as inner product?

2019-01-26 10:49发布

Since my np.dot is accelerated by OpenBlas and Openmpi I am wondering if there was a possibility to write the double sum

for i in range(N):
     for j in range(N):
         B[k,l] += A[i,j,k,l] * X[i,j]

as an inner product. Right at the moment I am using

B = np.einsum("ijkl,ij->kl",A,X)

but unfortunately it is quite slow and only uses one processor. Any ideas?

Edit: I benchmarked the answers given until now with a simple example, seems like they are all in the same order of magnitude:

A = np.random.random([200,200,100,100])
X = np.random.random([200,200])
def B1():
    return es("ijkl,ij->kl",A,X) 
def B2():
    return np.tensordot(A, X, [[0,1], [0, 1]])
def B3():
    shp = A.shape
    return np.dot(X.ravel(),A.reshape(shp[0]*shp[1],1)).reshape(shp[2],shp[3])

%timeit B1()
%timeit B2()
%timeit B3()

1 loops, best of 3: 300 ms per loop
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 150 ms per loop

Concluding from these results I would choose np.einsum, since its syntax is still the most readable and the improvement with the other two are only a factor 2x. I guess the next step would be to externalize the code into C or fortran.

3条回答
甜甜的少女心
2楼-- · 2019-01-26 11:30

You can use np.dot like so -

shp = A.shape
B_dot = np.dot(X.ravel(),A.reshape(shp[0]*shp[1],-1)).reshape(shp[2],shp[3])
查看更多
爷的心禁止访问
3楼-- · 2019-01-26 11:34

You can use np.tensordot():

np.tensordot(A, X, [[0,1], [0, 1]])

which does use multiple cores.


EDIT: it is insteresting to see how np.einsum and np.tensordot scale when increasing the size of the input arrays:

In [18]: for n in range(1, 31):
   ....:     A = np.random.rand(n, n+1, n+2, n+3)
   ....:     X = np.random.rand(n, n+1)
   ....:     print(n)
   ....:     %timeit np.einsum('ijkl,ij->kl', A, X)
   ....:     %timeit np.tensordot(A, X, [[0, 1], [0, 1]])
   ....:
1
1000000 loops, best of 3: 1.55 µs per loop
100000 loops, best of 3: 8.36 µs per loop
...
11
100000 loops, best of 3: 15.9 µs per loop
100000 loops, best of 3: 17.2 µs per loop
12
10000 loops, best of 3: 23.6 µs per loop
100000 loops, best of 3: 18.9 µs per loop
...
21
10000 loops, best of 3: 153 µs per loop
10000 loops, best of 3: 44.4 µs per loop

and it becomes clear the advantage of using tensordot for larger arrays.

查看更多
贪生不怕死
4楼-- · 2019-01-26 11:44

I find that tensordot outperforms einsum by a lot for some operations. I am using numpy from Anaconda with accelerate, and Intel's math kernel library (MKL) installed. Consider what happens when the second matrix has an extra dimension that is not summed over:

In [39]: A = np.random.random([200, 200, 100, 100])

In [40]: X = np.random.random([200, 200])

In [41]: Y = np.random.random([200, 200, 100])

A : (200, 200, 100, 100)

X : (200, 200)

Y : (200, 200, 100)

The operation I'm doing here is as follows:

A X ---> (100, 100)

A Y ---> (100, 100, 100)

The A Y operation basically has to do 100 of the A X operations and store each of them. Here is how tensor dot performs in this setting:

In [42]: %timeit tensordot(A, X, [(0,1), (0,1)])
1 loops, best of 3: 477 ms per loop

In [43]: %timeit tensordot(A, Y, [(0,1), (0,1)])
1 loops, best of 3: 1.35 s per loop

Stop and think about this for a second. In line [43] we just did 100 times the number of operations and it only took 3 times longer. I know MKL does some fancy use of CPU cache to avoid transferring from RAM. I suspect its reusing blocks of A for the extra 100 arrays of Y.

Using Einsum results in something more expected given that we have 100x more operations to do:

In [44]: %timeit einsum('ijkl,ij->kl', A, X)
1 loops, best of 3: 962 ms per loop

In [45]: %timeit einsum('ijkl,ijm->klm', A, Y)
1 loops, best of 3: 1min 45s per loop

It seems that einsum does very well when one of the argument arrays has all of its dimensions summed over. Using tensordot has huge performance gains when some dimensions are not summed over (analogous to np.outer but with multi-dimensional arrays).

Here is another example:

enter image description here

For the array operation:

50x1000x1000 X 50x1000x1000 -> 50x50

Using tensordot gives me 6 GFLOPS compared to 0.2 GFLOPS with einsum.

I think an important point is that Modern machines should be able to get in the 5-50 GFLOP range for large arrays. If you count operations and get less than that, check to see what library your using.

查看更多
登录 后发表回答