How can I “sparsify” on two values?

2019-09-13 13:07发布

问题:

consider the the pandas series s

n = 1000
s = pd.Series([0] * n + [1] * n, dtype=int)

s.memory_usage()

8080

I can "sparsify" this by using to_sparse

s.to_sparse(fill_value=0).memory_usage()

4080

But I only have 2 types of integers. I'd think I could sparsify twice. Is there a way to do this?

回答1:

Since you tagged this with scipy, I'll show you what a scipy.sparse matrix is like:

In [31]: n=100
In [32]: arr=np.array([[0]*n+[1]*n],int)
In [33]: M=sparse.csr_matrix(arr)
In [34]: M.data
Out[34]: 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [35]: M.indices
Out[35]: 
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
       113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
       126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
       139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
       152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
       165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
       178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
       191, 192, 193, 194, 195, 196, 197, 198, 199], dtype=int32)
In [36]: M.indptr
Out[36]: array([  0, 100], dtype=int32)

It has replaced the n elements of arr with 2 arrays each with n/2 elements. Even if I replace the int with uint8, the M.indices array will still be int32.

The fact that your pandas version has half the memory usage,suggests that it is just storing the indices, and some how noting that the data part is all 1s. But that's just a guess.

How much greater sparification do you expect?

====================

http://pandas.pydata.org/pandas-docs/stable/sparse.html

This example looks like pandas is implementing some sort of 'run' compression:

In [4]: sts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

It has identified 2 blocks, of length 2 each. It still has to store the 4 nonfill values in some array.

A csr sparse equivalent (for a row array):

In [1052]: arr=np.random.rand(10)
In [1053]: arr[2:-2]=0
In [1055]: M=sparse.csr_matrix(arr)
In [1056]: M
Out[1056]: 
<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [1057]: M.data
Out[1057]: array([ 0.37875012,  0.73703368,  0.7935645 ,  0.22948213])
In [1058]: M.indices
Out[1058]: array([0, 1, 8, 9], dtype=int32)
In [1059]: M.indptr
Out[1059]: array([0, 4], dtype=int32)

The pandas version might be more compact if the fill values occur in blocks. But I suspect

0         1.0
1         1.0
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8         1.0
9         1.0

would produce the same blocks. I don't see evidence of it's trying to identify the identical 1.0 values, and storing those as a value plus a count.

================

Based on @MaxU answer your ds stores 1000 1's, and two single element arrays that tell it where those values are stored.

In [56]: sp.memory_usage()
Out[56]: 1080

In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])

As long the nonfill values occur in big runs, the block arrays will be small. But if you scattered those 1000 values through out the series, you'd multiply the number of blocks substantially

 block locations: array([1,3,6,10,...])
 block lengths: array([1,1,1,2,1,...])

I can imagine mapping between the csr layout and the pandas blocks, but haven't worked out the details. The csr layout is meant to work with 2d arrays, with a clear concept of rows and columns. Looks like a sparse dataframe just contains sparse series objects.

===================

https://stackoverflow.com/a/38157234/901925 shows how to map from sparse dataframe values to a scipy sparse matrix. For each column (data series) it uses sp_values,fill_value,sp_index.

pandas/pandas/sparse/scipy_sparse.py has the code for interaction between scipy sparse and data series.

===================

kind='integer' produces sparse structure more likescipy.sparse`:

In [62]: n=5; s=pd.Series([0]*5+[1]*5, dtype=int)
In [63]: ss=s.to_sparse(fill_value=0, kind='integer')
In [64]: ss
Out[64]: 
0    0
1    0
2    0
3    0
4    0
5    1
6    1
7    1
8    1
9    1
dtype: int32
IntIndex
Indices: array([5, 6, 7, 8, 9])

contrast that with the default block:

dtype: int32
BlockIndex
Block locations: array([5])
Block lengths: array([5])

And equivalent column sparse matrix can be built with:

In [89]: data=ss.values
In [90]: data=ss.sp_values
In [91]: rows=ss.sp_index.indices
In [92]: cols=np.zeros_like(rows)
In [93]: sparse.csr_matrix((data,(rows,cols)))
Out[93]: 
<10x1 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>

There is a to_coo method, but it only works with the more complex pd.MultiIndex object (why?).



回答2:

Pandas documentation says:

Currently, float64, int64 and bool dtypes are supported.

so let's try to convert your series to bool value:

In [53]: s.memory_usage()
Out[53]: 8080

In [54]: s.to_sparse().memory_usage()
Out[54]: 4080

In [55]: sp = s.astype(bool).to_sparse()

In [56]: sp.memory_usage()
Out[56]: 1080

In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])