可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Is there a way to convert from a pandas.SparseDataFrame
to scipy.sparse.csr_matrix
, without generating a dense matrix in memory?
scipy.sparse.csr_matrix(df.values)
doesn't work as it generates a dense matrix which is cast to the csr_matrix
.
Thanks in advance!
回答1:
Pandas docs talks about an experimental conversion to scipy sparse, SparseSeries.to_coo:
http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse
================
edit - this is a special function from a multiindex, not a data frame. See the other answers for that. Note the difference in dates.
============
As of 0.20.0, there is a sdf.to_coo()
and a multiindex ss.to_coo()
. Since a sparse matrix is inherently 2d, it makes sense to require multiindex for the (effectively) 1d dataseries. While the dataframe can represent a table or 2d array.
When I first responded to this question this sparse dataframe/series feature was experimental (june 2015).
回答2:
Pandas 0.20.0+:
As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:
from scipy import sparse
def sparse_df_to_csr(df):
return sparse.csr_matrix(df.to_coo())
This uses the new to_coo()
method.
Earlier Versions:
Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrame
is sparse with all BlockIndex
(note: if it was created with get_dummies
, this will be the case).
Edit: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.
import numpy as np
import pandas as pd
from scipy import sparse
def sparse_BlockIndex_df_to_csr(df):
columns=df.columns
zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
df[col].sp_index.to_int_index().indices)
for col in columns])
data, rows=map(list, zipped_data)
cols=[np.ones_like(a)*i for (i,a) in enumerate(data)]
data_f = np.concatenate(data)
rows_f = np.concatenate(rows)
cols_f = np.concatenate(cols)
arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
df.shape, dtype=np.float64)
return arr.tocsr()
回答3:
The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%. My code also handles dense columns appropriately.
def sparse_df_to_array(df):
num_rows = df.shape[0]
data = []
row = []
col = []
for i, col_name in enumerate(df.columns):
if isinstance(df[col_name], pd.SparseSeries):
column_index = df[col_name].sp_index
if isinstance(column_index, BlockIndex):
column_index = column_index.to_int_index()
ix = column_index.indices
data.append(df[col_name].sp_values)
row.append(ix)
col.append(len(df[col_name].sp_values) * [i])
else:
data.append(df[col_name].values)
row.append(np.array(range(0, num_rows)))
col.append(np.array(num_rows * [i]))
data_f = np.concatenate(data)
row_f = np.concatenate(row)
col_f = np.concatenate(col)
arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
return arr.tocsr()
回答4:
Here's a solution that fills the sparse matrix column by column (assumes you can fit at least one column to memory).
import pandas as pd
import numpy as np
from scipy.sparse import lil_matrix
def sparse_df_to_array(df):
""" Convert sparse dataframe to sparse array csr_matrix used by
scikit learn. """
arr = lil_matrix(df.shape, dtype=np.float32)
for i, col in enumerate(df.columns):
ix = df[col] != 0
arr[np.where(ix), i] = df.ix[ix, col]
return arr.tocsr()
回答5:
EDIT: This method is actually having a dense representation at some stage, so it doesn't solve the question.
You should be able to use the experimental .to_coo()
method in pandas [1] in the following way:
df, idx_rows, idx_cols = df.stack().to_sparse().to_coo()
df = df.tocsr()
This method, instead of taking a DataFrame
(rows / columns) it takes a Series
with rows and columns in a MultiIndex
(this is why you need the .stack()
method). This Series
with the MultiIndex
needs to be a SparseSeries
, and even if your input is a SparseDataFrame
, .stack()
returns a regular Series
. So, you need to use the .to_sparse()
method before calling .to_coo()
.
The Series
returned by .stack()
, even if it's not a SparseSeries
only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan
when the type is np.float
).
- http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse