Since my np.dot
is accelerated by OpenBlas and Openmpi I am wondering if there was a possibility to write the double sum
for i in range(N):
for j in range(N):
B[k,l] += A[i,j,k,l] * X[i,j]
as an inner product. Right at the moment I am using
B = np.einsum("ijkl,ij->kl",A,X)
but unfortunately it is quite slow and only uses one processor. Any ideas?
Edit: I benchmarked the answers given until now with a simple example, seems like they are all in the same order of magnitude:
A = np.random.random([200,200,100,100])
X = np.random.random([200,200])
def B1():
return es("ijkl,ij->kl",A,X)
def B2():
return np.tensordot(A, X, [[0,1], [0, 1]])
def B3():
shp = A.shape
return np.dot(X.ravel(),A.reshape(shp[0]*shp[1],1)).reshape(shp[2],shp[3])
%timeit B1()
%timeit B2()
%timeit B3()
1 loops, best of 3: 300 ms per loop
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 150 ms per loop
Concluding from these results I would choose np.einsum, since its syntax is still the most readable and the improvement with the other two are only a factor 2x. I guess the next step would be to externalize the code into C or fortran.
You can use
np.dot
like so -You can use
np.tensordot()
:which does use multiple cores.
EDIT: it is insteresting to see how
np.einsum
andnp.tensordot
scale when increasing the size of the input arrays:and it becomes clear the advantage of using
tensordot
for larger arrays.I find that tensordot outperforms einsum by a lot for some operations. I am using numpy from Anaconda with accelerate, and Intel's math kernel library (MKL) installed. Consider what happens when the second matrix has an extra dimension that is not summed over:
A : (200, 200, 100, 100)
X : (200, 200)
Y : (200, 200, 100)
The operation I'm doing here is as follows:
A X ---> (100, 100)
A Y ---> (100, 100, 100)
The
A Y
operation basically has to do 100 of theA X
operations and store each of them. Here is how tensor dot performs in this setting:Stop and think about this for a second. In line [43] we just did 100 times the number of operations and it only took 3 times longer. I know MKL does some fancy use of CPU cache to avoid transferring from RAM. I suspect its reusing blocks of A for the extra 100 arrays of Y.
Using Einsum results in something more expected given that we have 100x more operations to do:
It seems that einsum does very well when one of the argument arrays has all of its dimensions summed over. Using
tensordot
has huge performance gains when some dimensions are not summed over (analogous tonp.outer
but with multi-dimensional arrays).Here is another example:
For the array operation:
50x1000x1000 X 50x1000x1000 -> 50x50
Using tensordot gives me 6 GFLOPS compared to 0.2 GFLOPS with einsum.
I think an important point is that Modern machines should be able to get in the 5-50 GFLOP range for large arrays. If you count operations and get less than that, check to see what library your using.