Why is numpy's einsum slower than numpy's

2019-01-18 05:28发布

站内文章 / Python

110 0

够拽才男人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I've usually gotten good performance out of numpy's einsum function (and I like it's syntax). @Ophion's answer to this question shows that - for the cases tested - einsum consistently outperforms the "built-in" functions (sometimes by a little, sometimes by a lot). But I just encountered a case where einsum is much slower. Consider the following equivalent functions:

(M, K) = (1000000, 20)
C = np.random.rand(K, K)
X = np.random.rand(M, K)

def func_dot(C, X):
    Y = X.dot(C)
    return np.sum(Y * X, axis=1)

def func_einsum(C, X):
    return np.einsum('ik,km,im->i', X, C, X)

def func_einsum2(C, X):
    # Like func_einsum but break it into two steps.
    A = np.einsum('ik,km', X, C)
    return np.einsum('ik,ik->i', A, X)

I expected func_einsum to run fastest but that is not what I encounter. Running on a quad-core cpu with hyperthreading, numpy version 1.9.0.dev-7ae0206, and multithreading with OpenBLAS, I get the following results:

In [2]: %time y1 = func_dot(C, X)
CPU times: user 320 ms, sys: 312 ms, total: 632 ms
Wall time: 209 ms
In [3]: %time y2 = func_einsum(C, X)
CPU times: user 844 ms, sys: 0 ns, total: 844 ms
Wall time: 842 ms
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 292 ms, sys: 44 ms, total: 336 ms
Wall time: 334 ms

When I increase K to 200, the differences are more extreme:

In [2]: %time y1= func_dot(C, X)
CPU times: user 4.5 s, sys: 1.02 s, total: 5.52 s
Wall time: 2.3 s
In [3]: %time y2= func_einsum(C, X)
CPU times: user 1min 16s, sys: 44 ms, total: 1min 16s
Wall time: 1min 16s
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 15.3 s, sys: 312 ms, total: 15.6 s
Wall time: 15.6 s

Can someone explain why einsum is so much slower here?

If it matters, here is my numpy config:

In [6]: np.show_config()
lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
atlas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = c
    include_dirs = ['/usr/local/include']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
atlas_blas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = f77
    include_dirs = ['/usr/local/include']
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

回答1:

You can have the best of both worlds:

def func_dot_einsum(C, X):
    Y = X.dot(C)
    return np.einsum('ij,ij->i', Y, X)

On my system:

In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop

In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop

In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop

In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop

When available, np.dot uses BLAS, MKL, or whatever library you have . So the call to np.dot is almost certainly being multithreaded. np.einsum has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation.

Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. So be advised that the following is speculative at best, but here's what I think is going on...

When you run something like np.einsum('ij,ij->i', a, b), the benefit over doing np.sum(a*b, axis=1) comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. So at the low level what goes on is something like:

for i in range(I):
    out[i] = 0
    for j in range(J):
        out[i] += a[i, j] * b[i, j]

Say now that you are after something like:

np.einsum('ij,jk,ik->i', a, b, c)

You could do the same operation as

np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))

And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster:

In [29]: a, b, c = np.random.rand(3, 100, 100)

In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop

In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop

But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. This is what I think einsum is doing at the low level:

for i in range(I):
    out[i] = 0
    for j in range(J):
        for k in range(K):
            out[i] += a[i, j] * b[j, k] * c[i, k]

But you are repeating a ton of operations! If you instead did:

for i in range(I):
    out[i] = 0
    for j in range(J):
        temp = 0
        for k in range(K):
            temp += b[j, k] * c[i, k]
        out[i] += a[i, j] * temp

you would be doing I * J * (K-1) less multiplications (and I * J extra additions), and save yourself a ton of time. My guess is that einsum is not smart enough to optimize things at this level. In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple...

回答2:

einsum has a specialized case for '2 operands, ndim=2'. In this case there are 3 operands, and a total of 3 dimensions. So it has to use a general nditer.

While trying to understand how the string input is parsed, I wrote a pure Python einsum simulator, https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py

The (stripped down) einsum and sum-of-products functions are:

def myeinsum(subscripts, *ops, **kwargs):
    # dropin preplacement for np.einsum (more or less)
    <parse subscript strings>
    <prepare op_axes>
    x = sum_of_prod(ops, op_axes, **kwargs)
    return x

def sum_of_prod(ops, op_axes,...):
    ...
    it = np.nditer(ops, flags, op_flags, op_axes)
    it.operands[nop][...] = 0
    it.reset()
    for (x,y,z,w) in it:
        w[...] += x*y*z
    return it.operands[nop]

Debugging output for myeinsum('ik,km,im->i',X,C,X,debug=True) with (M,K)=(10,5)

{'max_label': 109, 
 'min_label': 105, 
 'nop': 3, 
 'shapes': [(10, 5), (5, 5), (10, 5)], 
 ....}}
 ...
iter labels: [105, 107, 109],'ikm'

op_axes [[0, 1, -1], [-1, 0, 1], [0, -1, 1], [0, -1, -1]]

If you write a sum-of-prod function like this in cython you should get something close to the generalized einsum.

With the full (M,K), this simulated einsum is 6-7x slower.

Some timings building on the other answers:

In [84]: timeit np.dot(X,C)
1 loops, best of 3: 781 ms per loop

In [85]: timeit np.einsum('ik,km->im',X,C)
1 loops, best of 3: 1.28 s per loop

In [86]: timeit np.einsum('im,im->i',A,X)
10 loops, best of 3: 163 ms per loop

This 'im,im->i' step is substantially faster than the other. The sum dimension,mis only 20. I suspecteinsum` is treating this as a special case.

In [87]: timeit np.einsum('im,im->i',np.dot(X,C),X)
1 loops, best of 3: 950 ms per loop

In [88]: timeit np.einsum('im,im->i',np.einsum('ik,km->im',X,C),X)
1 loops, best of 3: 1.45 s per loop

The times for these composite calculations are simply sums of the corresponding pieces.