Sometime back this question
(now deleted but 10K+ rep users can still view it) was posted. It looked interesting to me and I learnt something new there while trying to solve it and I thought that's worth sharing. I would like to post those ideas/solutions and would love to see people post other possible ways to solve it. I am posting the gist of the question next.
So, we have two NumPy ndarrays a
and b
of shapes :
a : (m,n,N)
b : (n,m,N)
Let's assume we are dealing with cases where m
,n
& N
are comparable.
The problem is to solve the following multiplication and summation with focus on performance :
def all_loopy(a,b):
P,Q,N = a.shape
d = np.zeros(N)
for i in range(N):
for j in range(i):
for k in range(P):
for n in range(Q):
d[i] += a[k,n,i] * b[n,k,j]
return d
I learnt few things along the way trying to find vectorized and faster ways to solve it.
1) First off, there is a dependency of iterators at
"for j in range(i)"
. From my previous experience, especially with trying to solve such problems onMATLAB
, it appeared that such dependency could be taken care of with alower triangular matrix
, sonp.tril
should work there. Thus, a fully vectorized solution and not so memory efficient solution (as it creates an intermediate(N,N)
shaped array before finally reducing to(N,)
shaped array) would be -2) Next trick/idea was to keep one loop for the iterator
i
infor i in range(N)
, but insert that dependency with indexing and usingnp.einsum
to perform all those multiplications and summations. The advantage would be memory-efficiency. The implementation would look like this -There are two more obvious ways to solve it. So, if we start working from the original
all_loopy
solution, one could keep the outer two loops and usenp.einsum
ornp.tensordot
to perform those operations and thus remove the inner two loops, like so -Runtime test
Let's compare all the five approaches posted thus far to solve the problem, including the one posted in the question.
Case #1 :
Case #2 :
Case #3 (Leaving out all_loopy for being very slow) :
Going by the numbers,
einsum_oneloop
looks pretty good to me, whereasfully_vectorized
could be used when dealing with small to decent sized arrays!I'm not sure if you want it to be all-numpy but I've always used numba for slow but easy to implement loop-based functions. The speedup for loopy-intensive tasks is amazing. First I just
numba.njit
ted yourall_loopy
variant which already gave me comperative results:Then I tested it against your 100, 100, 100 case:
Apart from noticing that my computer is much slower than yours - the numba version is getting slower. What happened?
Numpy uses compiled versions and depending on the compiler options numpy will probably optimize the looping while numba doesn't. So the next logical step is the optimize the looping. Assuming C-contiguous arrays the innermost loops should be the last axis of the arrays. It's the fastest changing axis so the cache locality will be better.
so what are the timings of this "optimized" numba function? Can it compare with the others or even beat them?
so it's slightly faster for small matrices, what about big ones:
So we have an amazing speedup for large arrays. This function is now 4-5 times faster than your vectorized approaches and the result is the same.
But amazingly it seems that the ordering seems somehow dependant on the computer because the
fully_vectorized
is fastest where theeinsum
-options are faster on @Divakar's machine. So it might be open if these results are really that much faster.Just for fun I tried it with
n=m=N=200
: