Outer product in python seems quite slow when we have to deal with vectors of dimension of order 10k. Could someone please give me some idea how could I speed up this operation in python?
Code is as follows:
In [8]: a.shape
Out[8]: (128,)
In [9]: b.shape
Out[9]: (32000,)
In [10]: %timeit np.outer(b,a)
100 loops, best of 3: 15.4 ms per loop
Since I have to do this operation several times, my code is getting slower.
It doesn't really get any faster than that, these are your options:
numpy.outer
>>> %timeit np.outer(a,b)
100 loops, best of 3: 9.79 ms per loop
numpy.einsum
>>> %timeit np.einsum('i,j->ij', a, b)
100 loops, best of 3: 16.6 ms per loop
numba
from numba.decorators import autojit
@autojit
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n), dtype=np.float)
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
>>> %timeit outer_numba(a,b)
100 loops, best of 3: 9.77 ms per loop
parakeet
from parakeet import jit
@jit
def outer_parakeet(a, b):
... same as numba
>>> %timeit outer_parakeet(a, b)
100 loops, best of 3: 11.6 ms per loop
cython
cimport numpy as np
import numpy as np
cimport cython
ctypedef np.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
def outer_cython(np.ndarray[DTYPE_t, ndim=1] a, np.ndarray[DTYPE_t, ndim=1] b):
cdef int m = a.shape[0]
cdef int n = b.shape[0]
cdef np.ndarray[DTYPE_t, ndim=2] result = np.empty((m, n), dtype=np.float64)
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
>>> %timeit outer_cython(a, b)
100 loops, best of 3: 10.1 ms per loop
theano
from theano import tensor as T
from theano import function
x = T.vector()
y = T.vector()
outer_theano = function([x, y], T.outer(x, y))
>>> %timeit outer_theano(a, b)
100 loops, best of 3: 17.4 ms per loop
pypy
# Same code as the `outer_numba` function
>>> timeit.timeit("outer_pypy(a,b)", number=100, setup="import numpy as np;a = np.random.rand(128,);b = np.random.rand(32000,);from test import outer_pypy;outer_pypy(a,b)")*1000 / 100.0
16.36 # ms
Conclusions:
╔═══════════╦═══════════╦═════════╗
║ method ║ time(ms)* ║ version ║
╠═══════════╬═══════════╬═════════╣
║ numba ║ 9.77 ║ 0.16.0 ║
║ np.outer ║ 9.79 ║ 1.9.1 ║
║ cython ║ 10.1 ║ 0.21.2 ║
║ parakeet ║ 11.6 ║ 0.23.2 ║
║ pypy ║ 16.36 ║ 2.4.0 ║
║ np.einsum ║ 16.6 ║ 1.9.1 ║
║ theano ║ 17.4 ║ 0.6.0 ║
╚═══════════╩═══════════╩═════════╝
* less time = faster
@elyase's answer is great, and rightly accepted. Here's an additional suggestion that, if you can use it, might make the call to np.outer
even faster.
You say "I have to do this operation several times", so it is possible that you can reuse the array that holds the outer product, instead of allocating a new one each time. That can give a nice boost in performance.
First, some random data to work with:
In [32]: a = np.random.randn(128)
In [33]: b = np.random.randn(32000)
Here's the baseline timing for np.outer(a, b) on my computer:
In [34]: %timeit np.outer(a, b)
100 loops, best of 3: 5.52 ms per loop
Suppose we're going to repeat that operation several times, with arrays of the same shape. Create an out
array to hold the result:
In [35]: out = np.empty((128, 32000))
Now use out
as the third argument of np.outer
:
In [36]: %timeit np.outer(a, b, out)
100 loops, best of 3: 2.38 ms per loop
So you get a nice performance boost if you can reuse the array that holds the outer product.
You get a similar benefit if you use the out
argument of einsum
, and in the cython function if you add a third argument for the output instead of allocating it in the function with np.empty
. (The other compiled/jitted codes in @elyase's answer will probably benefit from this, too, but I only tried the cython version.)
Nota bene! The benefit shown above might not be realized in practice. The out
array fits in the L3 cache of my CPU, and when it is used in the loop performed by the timeit
command, it likely remains in the cache. In practice, the array might be moved out of the cache between calls to np.outer
. In that case, the improvement isn't so dramatic, but it should still be at least the cost of a call to np.empty()
, i.e.
In [53]: %timeit np.empty((128, 32000))
1000 loops, best of 3: 1.29 ms per loop
It should be as simple as using numpy.outer()
: a single function call which will be implemented in C for high performance.