Edit: See this question where I learned how to parallelize sparse matrix-vector multiplication in Python using Numba, and was able to tie with Matlab.
Original question:
I'm observing that sparse matrix-vector multiplication is about 4 or 5 times faster in Matlab than in Python (using scipy sparse matrices). Here are some details from the Matlab command line:
>> whos A
Name Size Bytes Class Attributes
A 47166x113954 610732376 double sparse
>> whos ATrans
Name Size Bytes Class Attributes
ATrans 113954x47166 610198072 double sparse
>> nnz(A)/numel(A)
ans =
0.0071
>> whos x
Name Size Bytes Class Attributes
x 113954x1 911632 double
>> myFun = @() A*x; timeit(myFun)
ans =
0.0601
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.0120
The matrix ATrans is the transpose of A. Notice that computing A times x in Matlab takes about .06 seconds, but if I use the weird "transpose trick" and compute ATrans' times x (which yields the same result as A times x), the computation takes .012 seconds. (I do not understand why this trick works.)
Here are some timing results from the Python command line:
In[50]: type(A)
Out[50]:
scipy.sparse.csc.csc_matrix
In[51]: A.shape
Out[51]:
(47166, 113954)
In[52]: type(x)
Out[52]:
numpy.ndarray
In[53]: x.shape
Out[53]:
(113954L, 1L)
In[54]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
...:
Out[54]:
0.054835451461831324
So the Python runtime to do A times x is about 4.5 times longer than the Matlab runtime, for the same matrix A. If I store A in csr format, the Python runtime is a little bit worse:
In[63]: A = sp.sparse.csr_matrix(A)
In[64]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
...:
Out[64]:
0.0722580226496575
Here's some info about which python version and which anaconda version I'm using:
In[2]: import sys; print('Python %s on %s' % (sys.version, sys.platform))
Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] on win32
Question: Why is this sparse matrix-vector multiplication faster in Matlab than in Python? And how can I make it equally fast in Python?
Edit 1: Here is a clue. In Python, if I set the number of threads to 1, the runtime for dense matrix-vector multiplication is impacted severely, but the runtime for sparse matrix-vector multiplication is nearly unchanged.
In[48]: M = np.random.rand(1000,1000)
In[49]: y = np.random.rand(1000,1)
In[50]: import mkl
In[51]: mkl.get_max_threads()
Out[51]:
20
In[52]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[52]:
7.232593519574948e-05
In[53]: mkl.set_num_threads(1)
In[54]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[54]:
0.00044465965093536396
In[56]: type(A)
Out[56]:
scipy.sparse.csc.csc_matrix
In[57]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[57]:
0.055780856886028685
In[58]: mkl.set_num_threads(20)
In[59]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[59]:
0.05550840215802509
So for dense matrix-vector products, setting the number of threads to 1 reduced the runtime by a factor of about 6. But for the sparse matrix-vector product, reducing the number of threads to 1 did not change the runtime.
I think this suggests that in Python the sparse matrix-vector multiplication is not being performed in parallel, whereas dense matrix-vector multiplication is making use of all available cores. Do you agree with this conclusion? And if so, is there a way to make use of all available cores for sparse matrix-vector multiplication in Python?
Note that Anaconda is supposed to use Intel MKL optimizations by default: https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations
Edit 2: I read here that "For sparse matrices, all level 2 [BLAS] operations except for the sparse triangular solvers are threaded" in the Intel MKL. This suggests to me that scipy is not using Intel MKL to perform sparse matrix-vector multiplication. It seems that @hpaulj (in an answer posted below) has confirmed this conclusion by inspecting the code for the function csr_matvec. So, can I just call the Intel MKL sparse matrix-vector multiplication function directly? How would I do that?
Edit 3: Here is an additional piece of evidence. The Matlab sparse matrix-vector multiplication runtime appears to be unchanged when I set the maximum number of threads equal to 1.
>> maxNumCompThreads
ans =
20
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.012545604076342
>> maxNumCompThreads(1) % set number of threads to 1
ans =
20
>> maxNumCompThreads % Check that max number of threads is 1
ans =
1
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.012164191957568
This makes me question my previous theory that Matlab's advantage is due to multithreading.