Huge sparse dataframe to scipy sparse matrix witho

2020-07-18 08:35发布

Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users). I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).

For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:

df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)

Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.

Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix? I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(

P.S. Same to get DMatrix for xgboost

UPDATE:

So i release next solution (will be thankful for optimisation suggestions):

 def sparse_df_to_saprse_matrix (sparse_df):
    index_list = sparse_df.index.values.tolist()
    matrix_columns = []
    sparse_matrix = None

    for column in sparse_df.columns:
        sps_series = sparse_df[column]
        sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
        curr_sps_column, rows, cols = sps_series.to_coo()
        if sparse_matrix != None:
            sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
        else:
            sparse_matrix = curr_sps_column
        matrix_columns.extend(cols)

    return sparse_matrix, index_list, matrix_columns

And the following code allows to get sparse dataframe:

one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)

I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).

Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?

2条回答
狗以群分
2楼-- · 2020-07-18 09:09

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:

one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()

This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().

The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).

  1. http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
查看更多
相关推荐>>
3楼-- · 2020-07-18 09:20

Does my answer from a few months back help?

Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

It was accepted but I didn't get any further feedback.

I'm familiar with the scipy sparse formats and their inputs, but don't know much about pandas sparse.

查看更多
登录 后发表回答