Huge sparse dataframe to scipy sparse matrix witho

Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users). I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).

For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:

df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)

Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.

Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix? I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(

P.S. Same to get DMatrix for xgboost

UPDATE:

So i release next solution (will be thankful for optimisation suggestions):

 def sparse_df_to_saprse_matrix (sparse_df):
    index_list = sparse_df.index.values.tolist()
    matrix_columns = []
    sparse_matrix = None

    for column in sparse_df.columns:
        sps_series = sparse_df[column]
        sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
        curr_sps_column, rows, cols = sps_series.to_coo()
        if sparse_matrix != None:
            sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
        else:
            sparse_matrix = curr_sps_column
        matrix_columns.extend(cols)

    return sparse_matrix, index_list, matrix_columns

And the following code allows to get sparse dataframe:

one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)

I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).

Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?

标签： pandas machine-learning scipy scikit-learn data-analysis

2条回答

狗以群分

2楼-- · 2020-07-18 09:09

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:

one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()

This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().

The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).

http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse

0人赞添加讨论(0) 举报

Huge sparse dataframe to scipy sparse matrix witho

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间