Since my np.dot
is accelerated by OpenBlas and Openmpi I am wondering if there was a possibility to write the double sum
for i in range(N):
for j in range(N):
B[k,l] += A[i,j,k,l] * X[i,j]
as an inner product. Right at the moment I am using
B = np.einsum("ijkl,ij->kl",A,X)
but unfortunately it is quite slow and only uses one processor.
Any ideas?
Edit:
I benchmarked the answers given until now with a simple example, seems like they are all in the same order of magnitude:
A = np.random.random([200,200,100,100])
X = np.random.random([200,200])
def B1():
return es("ijkl,ij->kl",A,X)
def B2():
return np.tensordot(A, X, [[0,1], [0, 1]])
def B3():
shp = A.shape
return np.dot(X.ravel(),A.reshape(shp[0]*shp[1],1)).reshape(shp[2],shp[3])
%timeit B1()
%timeit B2()
%timeit B3()
1 loops, best of 3: 300 ms per loop
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 150 ms per loop
Concluding from these results I would choose np.einsum, since its syntax is still the most readable and the improvement with the other two are only a factor 2x. I guess the next step would be to externalize the code into C or fortran.
You can use np.tensordot()
:
np.tensordot(A, X, [[0,1], [0, 1]])
which does use multiple cores.
EDIT: it is insteresting to see how np.einsum
and np.tensordot
scale when increasing the size of the input arrays:
In [18]: for n in range(1, 31):
....: A = np.random.rand(n, n+1, n+2, n+3)
....: X = np.random.rand(n, n+1)
....: print(n)
....: %timeit np.einsum('ijkl,ij->kl', A, X)
....: %timeit np.tensordot(A, X, [[0, 1], [0, 1]])
....:
1
1000000 loops, best of 3: 1.55 µs per loop
100000 loops, best of 3: 8.36 µs per loop
...
11
100000 loops, best of 3: 15.9 µs per loop
100000 loops, best of 3: 17.2 µs per loop
12
10000 loops, best of 3: 23.6 µs per loop
100000 loops, best of 3: 18.9 µs per loop
...
21
10000 loops, best of 3: 153 µs per loop
10000 loops, best of 3: 44.4 µs per loop
and it becomes clear the advantage of using tensordot
for larger arrays.
You can use np.dot
like so -
shp = A.shape
B_dot = np.dot(X.ravel(),A.reshape(shp[0]*shp[1],-1)).reshape(shp[2],shp[3])
I find that tensordot outperforms einsum by a lot for some operations. I am using numpy from Anaconda with accelerate, and Intel's math kernel library (MKL) installed. Consider what happens when the second matrix has an extra dimension that is not summed over:
In [39]: A = np.random.random([200, 200, 100, 100])
In [40]: X = np.random.random([200, 200])
In [41]: Y = np.random.random([200, 200, 100])
A : (200, 200, 100, 100)
X : (200, 200)
Y : (200, 200, 100)
The operation I'm doing here is as follows:
A X ---> (100, 100)
A Y ---> (100, 100, 100)
The A Y
operation basically has to do 100 of the A X
operations and store each of them. Here is how tensor dot performs in this setting:
In [42]: %timeit tensordot(A, X, [(0,1), (0,1)])
1 loops, best of 3: 477 ms per loop
In [43]: %timeit tensordot(A, Y, [(0,1), (0,1)])
1 loops, best of 3: 1.35 s per loop
Stop and think about this for a second. In line [43] we just did 100 times the number of operations and it only took 3 times longer. I know MKL does some fancy use of CPU cache to avoid transferring from RAM. I suspect its reusing blocks of A for the extra 100 arrays of Y.
Using Einsum results in something more expected given that we have 100x more operations to do:
In [44]: %timeit einsum('ijkl,ij->kl', A, X)
1 loops, best of 3: 962 ms per loop
In [45]: %timeit einsum('ijkl,ijm->klm', A, Y)
1 loops, best of 3: 1min 45s per loop
It seems that einsum does very well when one of the argument arrays has all of its dimensions summed over. Using tensordot
has huge performance gains when some dimensions are not summed over (analogous to np.outer
but with multi-dimensional arrays).
Here is another example:
For the array operation:
50x1000x1000 X 50x1000x1000 -> 50x50
Using tensordot gives me 6 GFLOPS compared to 0.2 GFLOPS with einsum.
I think an important point is that Modern machines should be able to get in the 5-50 GFLOP range for large arrays. If you count operations and get less than that, check to see what library your using.