Find all-zero columns in pandas sparse matrix

2019-05-24 11:23发布

站内文章 / Python

46 0

闹够了就滚

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

For example I have a coo_matrix A :

import scipy.sparse as sp
A = sp.coo_matrix([3,0,3,0],
                  [0,0,2,0],
                  [2,5,1,0],
                  [0,0,0,0])

How can I get result [0,0,0,1], which indicates that first 3 columns contain non-zero values, only the 4th column is all zeros.

PS : cannot convert A to other type.
PS2 : I tried using np.nonzeros but it seems that my implementation is not very elegant.

回答1:

Approach #1 We could do something like this -

# Get the columns indices of the input sparse matrix
C = sp.find(A)[1]

# Use np.in1d to create a mask of non-zero columns. 
# So, we invert it and convert to int dtype for desired output.
out = (~np.in1d(np.arange(A.shape[1]),C)).astype(int)

Alternatively, to make the code shorter, we can use subtraction -

out = 1-np.in1d(np.arange(A.shape[1]),C)

Step-by-step run -

1) Input array and sparse matrix from it :

In [137]: arr             # Regular dense array
Out[137]: 
array([[3, 0, 3, 0],
       [0, 0, 2, 0],
       [2, 5, 1, 0],
       [0, 0, 0, 0]])

In [138]: A = sp.coo_matrix(arr) # Convert to sparse matrix as input here on

2) Get non-zero column indices :

In [139]: C = sp.find(A)[1]

In [140]: C
Out[140]: array([0, 2, 2, 0, 1, 2], dtype=int32)

3) Use np.in1d to get mask of non-zero columns :

In [141]: np.in1d(np.arange(A.shape[1]),C)
Out[141]: array([ True,  True,  True, False], dtype=bool)

4) Invert it :

In [142]: ~np.in1d(np.arange(A.shape[1]),C)
Out[142]: array([False, False, False,  True], dtype=bool)

5) Finally convert to int dtype :

In [143]: (~np.in1d(np.arange(A.shape[1]),C)).astype(int)
Out[143]: array([0, 0, 0, 1])

Alternative subtraction approach :

In [145]: 1-np.in1d(np.arange(A.shape[1]),C)
Out[145]: array([0, 0, 0, 1])

Approach #2 Here's another way and possibly a faster one using matrix-multiplication -

out = 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)

Runtime test

Let's test out all the posted approaches on a big and really sparse matrix -

In [29]: A = sp.coo_matrix((np.random.rand(4000,4000)>0.998).astype(int))

In [30]: %timeit 1-np.in1d(np.arange(A.shape[1]),sp.find(A)[1])
100 loops, best of 3: 4.12 ms per loop # Approach1

In [31]: %timeit 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)
1000 loops, best of 3: 771 µs per loop # Approach2

In [32]: %timeit 1 - (A.col==np.arange(A.shape[1])[:,None]).any(axis=1)
1 loops, best of 3: 236 ms per loop # @hpaulj's soln

In [33]: %timeit (A!=0).sum(axis=0)==0
1000 loops, best of 3: 1.03 ms per loop  # @jez's soln

In [34]: %timeit (np.sum(np.absolute(A.toarray()), 0) == 0) * 1
10 loops, best of 3: 86.4 ms per loop  # @wwii's soln

回答2:

The actual logical operation can be performed like this:

b = (A!=0).sum(axis=0)==0
# matrix([[False, False, False,  True]], dtype=bool)

Now, to ensure that I'm answering your question exactly, I'd better tell you how you could convert from booleans to integers (although really, for most applications I can think of, you can do a lot more in numpy and friends if you stick with an array of bools):

b = b.astype(int)
#matrix([[0, 0, 0, 1]])

Either way, to then convert from a matrix to a list, you could do this:

c = list(b.flat)
# [0, 0, 0, 1]

...although again, I'm not sure this is the best thing to do: for most applications I can imagine, I would perhaps just convert to a one-dimensional numpy.array with c = b.A.flatten() instead.

回答3:

Recent

scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize

similar, except it wants to fill those columns with 1s and normalize them.

I immediately suggested the lil format of the transpose. All-0 columns will be empty lists in this format. But sticking with the coo format I suggested

np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]

or for this 1/0 format

1 - (Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1)

which is functionally the same as:

1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)

sparse.find converts the matrix to csr to sum duplicates and eliminate duplicates, and then back to coo to get the data, row, and col attributes (which it returns).

Mo.nonzero uses A.data != 0 to eliminate 0s before returning the col and row attributes.

The np.ones(A.shape[0],dtype=bool)*A.astype(bool) solution requires converting A to csr format for multiplication.

(A!=0).sum(axis=0) also converts to csr because column (or row) sum is done with a matrix multiplication.

So the no-convert requirement is unrealistic, at least within the bounds of sparse formats.

===============

For Divakar's test case my == version is quite slow; it's ok with small ones, but creates too large of test array with the 1000 columns.

Testing on a matrix that is sparse enough to have a number of 0 columns:

In [183]: Arr=sparse.random(1000,1000,.001)
In [184]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).any()
Out[184]: True
In [185]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).sum()
Out[185]: 367

In [186]: timeit 1-np.ones(Arr.shape[0],dtype=bool)*Arr.astype(bool)
1000 loops, best of 3: 334 µs per loop
In [187]: timeit 1-np.in1d(np.arange(Arr.shape[1]),Arr.col)
1000 loops, best of 3: 323 µs per loop
In [188]: timeit 1-(Arr.col==np.arange(Arr.shape[1])[:,None]).any(axis=1)
100 loops, best of 3: 3.9 ms per loop
In [189]: timeit (Arr!=0).sum(axis=0)==0
1000 loops, best of 3: 820 µs per loop

回答4:

Convert to an array or dense matrix, sum the absolute value along the first axis, test the result against zero, convert to int

>>> import numpy as np
>>> (np.sum(np.absolute(a.toarray()), 0) == 0) * 1
array([0, 0, 0, 1])
>>> (np.sum(np.absolute(a.todense()), 0) == 0) * 1
matrix([[0, 0, 0, 1]])
>>> 
>>> np.asarray((np.sum(np.absolute(a.todense()), 0) == 0), dtype = np.int32)
array([[0, 0, 0, 1]])
>>>

The first is the fastest - 24 uS for your example on my machine.

For a matrix made with np.random.randint(0,3,(1000,1000)), all are right at 25 mS on my machine.