Scipy.sparse.csr_matrix: How to get top ten values

2020-02-01 01:51发布

问题:

I have a large csr_matrix and I am interested in the top ten values and their indices each row. But I did not find a decent way to manipulate the matrix.

Here is my current solution and the main idea is to process them row by row:

row = csr_matrix.getrow(row_number).toarray()[0].ravel()
top_ten_indicies = row.argsort()[-10:]
top_ten_values = row[row.argsort()[-10:]]

By doing this, the advantages of csr_matrix is not fully used. It's more like a brute force solution.

回答1:

I don't see what the advantages of csr format are in this case. Sure, all the nonzero values are collected in one .data array, with the corresponding column indexes in .indices. But they are in blocks of varying length. And that means they can't be processed in parallel or with numpy array strides.

One solution is the pad those blocks into common length blocks. That's what .toarray() does. Then you can find the maximum values with argsort(axis=1) or withargpartition`.

Another is to break them into row sized blocks, and process each of those. That's what you are doing with the .getrow. Another way of breaking them up is convert to lil format, and process the sublists of the .data and .rows arrays.

A possible third option is to use the ufunc reduceat method. This lets you apply ufunc reduction methods to sequential blocks of an array. There are established ufunc like np.add that take advantage of this. argsort is not such a function. But there is a way of constructing a ufunc from a Python function, and gain some modest speed over regular Python iteration. [I need to look up a recent SO question that illustrates this.]

I'll illustrate some of this with a simpler function, sum over rows.

If A2 is a csr matrix.

A2.sum(axis=1)  # the fastest compile csr method
A2.A.sum(axis=1)  # same, but with a dense intermediary
[np.sum(l.data) for l in A2]  # iterate over the rows of A2
[np.sum(A2.getrow(i).data) for i in range(A2.shape[0])]  # iterate with index
[np.sum(l) for l in A2.tolil().data]  # sum the sublists of lil format
np.add.reduceat(A2.data, A2.indptr[:-1])  # with reduceat

A2.sum(axis=1) is implemented as a matrix multiplication. That's not relevant to the sort problem, but still an interesting way of looking at the summation problem. Remember csr format was developed for efficient multiplication.

For a my current sample matrix (created for another SO sparse question)

<8x47752 sparse matrix of type '<class 'numpy.float32'>'
     with 32 stored elements in Compressed Sparse Row format>

some comparative times are

In [694]: timeit np.add.reduceat(A2.data, A2.indptr[:-1])
100000 loops, best of 3: 7.41 µs per loop

In [695]: timeit A2.sum(axis=1)
10000 loops, best of 3: 71.6 µs per loop

In [696]: timeit [np.sum(l) for l in A2.tolil().data]
1000 loops, best of 3: 280 µs per loop

Everything else is 1ms or more.

I suggest focusing on developing your one-row function, something like:

def max_n(row_data, row_indices, n):
    i = row_data.argsort()[-n:]
    # i = row_data.argpartition(-n)[-n:]
    top_values = row_data[i]
    top_indices = row_indices[i]  # do the sparse indices matter?
    return top_values, top_indices, i

Then see how if fits in one of these iteration methods. tolil() looks most promising.

I haven't addressed the question of how to collect these results. Should they be lists of lists, array with 10 columns, another sparse matrix with 10 values per row, etc.?


sorting each row of a large sparse & saving top K values & column index - Similar question from several years back, but unanswered.

Argmax of each row or column in scipy sparse matrix - Recent question seeking argmax for rows of csr. I discuss some of the same issues.

how to speed up loop in numpy? - example of how to use np.frompyfunc to create a ufunc. I don't know if the resulting function has the .reduceat method.

Increasing value of top k elements in sparse matrix - get the top k elements of csr (not by row). Case for argpartition.


The row summation implemented with np.frompyfunc:

In [741]: def foo(a,b):
    return a+b  
In [742]: vfoo=np.frompyfunc(foo,2,1)
In [743]: timeit vfoo.reduceat(A2.data,A2.indptr[:-1],dtype=object).astype(float)
10000 loops, best of 3: 26.2 µs per loop

That's respectable speed. But I can't think of a way of writing a binary function (takes to 2 arguments) that would implement argsort via reduction. So this is probably a deadend for this problem.



回答2:

Just to answer the original question (for people like me who found this question looking for copy-pasta), here's a solution using multiprocessing based on @hpaulj's suggestion of converting to lil_matrix, and iterating over rows

from multiprocessing import Pool

def _top_k(args):
    """
    Helper function to process a single row of top_k
    """
    data, row = args
    data, row = zip(*sorted(zip(data, row), reverse=True)[:k])
    return data, row

def top_k(m, k):
    """
    Keep only the top k elements of each row in a csr_matrix
    """
    ml = m.tolil()
    with Pool() as p:
        ms = p.map(_top_k, zip(ml.data, ml.rows))
    ml.data, ml.rows = zip(*ms)
    return ml.tocsr()