I'm using the Anaconda distribution of Python, together with Numba, and I've written the following Python function that multiplies a sparse matrix A
(stored in a CSR format) by a dense vector x
:
@jit
def csrMult( x, Adata, Aindices, Aindptr, Ashape ):
numRowsA = Ashape[0]
Ax = numpy.zeros( numRowsA )
for i in range( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i], Aindptr[i+1] ):
j = Aindices[dataIdx]
Ax_i += Adata[dataIdx] * x[j]
Ax[i] = Ax_i
return Ax
Here A
is a large scipy
sparse matrix,
>>> A.shape
( 56469, 39279 )
# having ~ 142,258,302 nonzero entries (so about 6.4% )
>>> type( A[0,0] )
dtype( 'float32' )
and x
is a numpy
array. Here is a snippet of code that calls the above function:
x = numpy.random.randn( A.shape[1] )
Ax = A.dot( x )
AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )
Notice the @jit
-decorator that tells Numba to do a just-in-time compilation for the csrMult()
function.
In my experiments, my function csrMult()
is about twice as fast as the scipy
.dot()
method. That is a pretty impressive result for Numba.
However, MATLAB still performs this matrix-vector multiplication about 6 times faster than csrMult()
. I believe that is because MATLAB uses multithreading when performing sparse matrix-vector multiplication.
Question:
How can I parallelize the outer for
-loop when using Numba?
Numba used to have a prange()
function, that made it simple to parallelize embarassingly parallel for
-loops. Unfortunately, Numba no longer has prange()
[actually, that is false, see the edit below]. So what is the correct way to parallelize this for
-loop now, that Numba's prange()
function is gone?
When prange()
was removed from Numba, what alternative did the developers of Numba have in mind?
Edit 1:
I updated to the latest version of Numba, which is .35, and prange()
is back! It was not included in version .33, the version I had been using.
That is good news, but unfortunately I am getting an error message when I attempt to parallelize my for loop using prange()
. Here is a parallel for loop example from the Numba documentation (see section 1.9.2 "Explicit Parallel Loops"), and below is my new code:
from numba import njit, prange
@njit( parallel=True )
def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):
numRowsA = Ashape[0]
Ax = np.zeros( numRowsA )
for i in prange( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i],Aindptr[i+1] ):
j = Aindices[dataIdx]
Ax_i += Adata[dataIdx] * x[j]
Ax[i] = Ax_i
return Ax
When I call this function, using the code snippet given above, I receive the following error:
AttributeError: Failed at nopython (convert to parfors) 'SetItem'
object has no attribute 'get_targets'
Given
the above attempt to use prange
crashes, my question stands:
What is the correct way ( using prange
or an alternative method ) to parallelize this Python for
-loop?
As noted below, it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup, having been run on 20-omp-threads. There must be a way to do it using Numba, since the for loop is embarrassingly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).
Edit 2:
Here is my C++ version of csrMult()
. Parallelizing the for()
loop in the C++ version makes the code about 8x faster in my tests. This suggests to me that a similar speedup should be possible for the Python version when using Numba.
void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr)
{
// This code assumes that the size of Ax is numRowsA.
#pragma omp parallel num_threads(20)
{
#pragma omp for schedule(dynamic,590)
for (int i = 0; i < Ax.size(); i++)
{
double Ax_i = 0.0;
for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++)
{
Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]];
}
Ax[i] = Ax_i;
}
}
}
Thanks for your quant updates, Daniel.
The following lines might be hard to swallow, but kindly believe me, there are more things to take into account. I have worked on HPC / parallel computing problems having matrices in scales ~ N [TB]; N > 10
and their sparse accompanions, so some pieces of experience may be useful for your further views.
WARNING: Do not expect any dinner to be served for free
A wish to parallelise a piece of code sounds like a more and more often contemporary re-articulated mana. The problem is not the code, but the cost of such move.
The economy is the number one problem. Amdahl's Law, as it was originally formulated by Gene Amdahl, did not take into account the very costs of [PAR]
-processes-setups + [PAR]
-processes-finalisations & terminations, which indeed have to be paid in every real-world implementation.
The overhead-strict Amdahl's Law depicts the scale of these un-avoidable adverse effects and helps understand a few new aspects that have to be evaluated before one opts to introduce parallelisation ( at an acceptable cost of doing so, as it is very, indeed VERY EASY to pay MUCH more than one may gain from -- where a naive disappointment from a degraded processing performance is the easier part of the story ).
Feel free to read more posts on overhead-strict Amdahl's Law re-formulation, if willing to better understand this topic and to pre-calculate the actual "minimum"-subProblem-"size", for which the sum-of-[PAR]
-overheads will get at least justified from real-world tools for introducing the parallel-split of the subProblem onto N_trully_[PAR]_processes
( not any "just"-[CONCURRENT]
, but true-[PARALLEL]
-- these are way not equal ).
Python may get a dose of steroids for increased performance:
Python is a great prototyping eco-system, whereas numba
, numpy
and other compiled-extensions help a lot to boost performance way farther than a native, GIL-stepped python (co-)-processing typically delivers.
Here, you try to enforce numba.jit()
to arrange the job almost-for-free, just by its automated jit()
-time lexical-analyser ( that you throw your code on ), which ought both "understand" your global goal ( What to do ), and also propose some vectorisation-tricks ( How best assemble a heap of CPU-instructions for maximum efficiency of such code-execution ).
This sounds easy, but it is not.
Travis Oliphant's team has made immense progress on numba
tools, but let's be realistic and fair not to expect any form of automated-wizardry to get implemented inside a .jit()
-lexer + code-analysis, when trying to transform a code and assemble a more efficient flow of machine instructions to implement the high-level task's goal.
@guvectorize
? Here? Seriously?
Due to [PSPACE]
sizing, you may immediately forget to ask numba
to somehow efficiently "stuff" the GPU-engine with data, a memory-footprint of which is way behind the GPU-GDDR sizings ( not speaking at all about too-"shallow" GPU-kernel sizes for such mathematically-"tiny" processing to just multiply, potentially in [PAR]
, but to later sum in [SEQ]
).
(Re-)-Loading GPU with data takes heaps of time. If having paid that, the In-GPU memory-latencies are not very friendly for "tiny"-GPU-kernels economy either -- your GPU-SMX code-execution will have to pay ~ 350-700 [ns]
just to fetch a number ( most probably not automatically re-aligned for the best coalesced SM-cache-friendly re-use in next steps and you may notice, that you never, let me repeat it, NEVER re-use a single matrix cell at all, so caching per-se will not deliver anything under those 350~700 [ns]
per matrix cell ), while a smart pure numpy
-vectorised code can process matrix-vector product in less than 1 [ns]
per cell on even the largest [PSPACE]
-footprints.
That is a yardstick to compare to.
( Profiling would better show here the hard-facts, but the principle is well known beforehand, without testing how to move a few TB
of data onto GPU-fabric just to realise this on one's own. )
The worst of the bad news:
Given the memory scales of the matrix A
, the worse effect to be expected is, that the sparse-organisation of the storage of the matrix representation will most probably devastate most, if not all, possible performance gains achievable by numba
-vectorised tricks on dense matrix representations, as there will likely be almost zero chance for efficient memory fetched cache-line re-uses and sparsity will also break any easy way to achieve a compact mapping of vectorised operations and these will hardly remain able to get easily translated into advanced CPU-hardware vector-processing resources.
Inventory of solvable problems:
- always better pre-allocate the vector
Ax = np.zeros_like( A[:,0] )
and pass it as another parameter into the numba.jit()
-compiled parts of the code, so as to avoid repetitive paying additional [PTIME,PSPACE]
-costs for creating ( again ) new memory-allocations ( the more if the vector is suspect for being used inside an externally orchestrated iterative optimisation process )
- always better specify ( to narrow the universality, for the sake of resulting code performance )
at least the numba.jit( "f8[:]( f4[:], f4[:,:], ... )" )
-calling interface directives
- always review all
numba.jit()
-options available and their respective default values ( may change version to version ) for your specific situation ( disabling GIL and better aligning the goals with numba
+ hardware capabilities will always help in numerically intensive parts of the code )
@jit( signature = [ numba.float32( numba.float32, numba.int32 ), # # [_v41] @decorator with a list of calling-signatures
numba.float64( numba.float64, numba.int64 ) #
], #__________________ a list of signatures for prepared alternative code-paths, to avoid a deferred lazy-compilation if undefined
nopython = False, #__________________ forces the function to be compiled in nopython mode. If not possible, compilation will raise an error.
nogil = False, #__________________ tries to release the global interpreter lock inside the compiled function. The GIL will only be released if Numba can compile the function in nopython mode, otherwise a compilation warning will be printed.
cache = False, #__________________ enables a file-based cache to shorten compilation times when the function was already compiled in a previous invocation. The cache is maintained in the __pycache__ subdirectory of the directory containing the source file.
forceobj = False, #__________________ forces the function to be compiled in object mode. Since object mode is slower than nopython mode, this is mostly useful for testing purposes.
locals = {} #__________________ a mapping of local variable names to Numba Types.
) #____________________# [_v41] ZERO <____ TEST *ALL* CALLED sub-func()-s to @.jit() too >>>>>>>>>>>>>>>>>>>>> [DONE]
def r...(...):
...
Numba has been updated and prange()
works now! (I'm answering my own question.)
The improvements to Numba's parallel computing capabilities are discussed in this blog post, dated December 12, 2017. Here is a relevant snippet from the blog:
Long ago (more than 20 releases!), Numba used to have support for an
idiom to write parallel for loops called prange()
. After a major
refactoring of the code base in 2014, this feature had to be removed,
but it has been one of the most frequently requested Numba features
since that time. After the Intel developers parallelized array
expressions, they realized that bringing back prange
would be fairly
easy
Using Numba version 0.36.1, I can parallelize my embarrassingly parallel for
-loop using the following simple code:
@numba.jit(nopython=True, parallel=True)
def csrMult_parallel(x,Adata,Aindices,Aindptr,Ashape):
numRowsA = Ashape[0]
Ax = np.zeros(numRowsA)
for i in numba.prange(numRowsA):
Ax_i = 0.0
for dataIdx in range(Aindptr[i],Aindptr[i+1]):
j = Aindices[dataIdx]
Ax_i += Adata[dataIdx]*x[j]
Ax[i] = Ax_i
return Ax
In my experiments, parallelizing the for
-loop made the function execute about eight times faster than the version I posted at the beginning of my question, which was already using Numba, but which was not parallelized. Moreover, in my experiments the parallelized version is about 5x faster than the command Ax = A.dot(x)
which uses scipy's sparse matrix-vector multiplication function. Numba has crushed scipy and I finally have a python sparse matrix-vector multiplication routine that is as fast as MATLAB.