Can pandas SparseSeries store values in the float1

2019-09-08 17:59发布

问题:

The reason why I want to use a smaller data type in the sparse pandas containers is to reduce memory usage. This is relevant when working with data that originally uses bool (e.g. from to_dummies) or small numeric dtypes (e.g. int8), which are all converted to float64 in sparse containers.

DataFrame creation

The provided example uses a modest 20k x 145 dataframe. In practice I'm working with dataframes in the order of 1e6 x 5e3.

In []: bool_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: bool(145)
memory usage: 2.7 MB

In []: bool_df.memory_usage(index=False).sum()
Out[]: 2878105

In []: bool_df.values.itemsize
Out[]: 1

A sparse version of this dataframe needs less memory, but is still much larger than needed, given the original dtype.

In []: sparse_df = bool_df.to_sparse(fill_value=False)

In []: sparse_df.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: float64(145)
memory usage: 1.1 MB

In []: sparse_df.memory_usage(index=False).sum()
Out[]: 1143456

In []: sparse_df.values.itemsize
Out[]: 8

Even though this data is fairly sparse, the dtype conversion from bool to float64 causes non-fill values to take up 8x more space.

In []: sparse_df.memory_usage(index=False).describe()
Out[]:
count      145.000000
mean      7885.903448
std      17343.762402
min          8.000000
25%        640.000000
50%       1888.000000
75%       4440.000000
max      84688.000000

Given the sparsity of the data, one would hope for a more drastic reduction in memory size:

In []: sparse_df.density
Out[]: 0.04966184346992205

Memory footprint of underlying storage

The columns of SparseDataFrame are SparseSeries, which use SparseArray as a wrapper for the underlying numpy.ndarray storage. The number of bytes that are used by the sparse dataframe can (also) be computed directly from these ndarrays:

In []: col64_nbytes = [
.....:     sparse_df[col].values.sp_values.nbytes
.....:     for col in sparse_df
.....: ]

In []: sum(col64_nbytes)
Out[]: 1143456

The ndarrays can be converted to use smaller floats, which allows one to calculate how much memory the dataframe would need when using e.g. float16s. This would result in a 4x smaller dataframe, as one might expect.

In []: col16_nbytes = [
.....:     sparse_df[col].values.sp_values.astype('float16').nbytes
.....:     for col in sparse_df
.....: ]

In []: sum(col16_nbytes)
Out[]: 285864

By using the more appropriate dtype, the memory usage can be reduced to 10% of the dense version, whereas the float64 sparse dataframe reduces to 40%. For my data, this could make the difference between needing 20 GB and 5 GB of available memory.

In []: sum(col64_nbytes) / bool_df.memory_usage(index=False).sum()
Out[]: 0.3972947477593764

In []: sum(col16_nbytes) / bool_df.memory_usage(index=False).sum()
Out[]: 0.0993236869398441

Issue

Unfortunately, dtype conversion of sparse containers has not been implemented in pandas:

In []: sparse_df.astype('float16')
---------------------------------------------------
[...]/pandas/sparse/frame.py in astype(self, dtype)
    245
    246     def astype(self, dtype):
--> 247         raise NotImplementedError
    248
    249     def copy(self, deep=True):

NotImplementedError:

How can the SparseSeries in a SparseDataFrame be converted to use the numpy.float16 data type, or another dtype that uses fewer than 64 bytes per item, instead of the default numpy.float64?

回答1:

The SparseArray constructor can be used to convert its underlying ndarray's dtype. To convert all sparse series in a dataframe, one can iterate over the df's series, convert their arrays, and replace the series with converted versions.

import pandas as pd
import numpy as np

def convert_sparse_series_dtype(sparse_series, dtype):
    dtype = np.dtype(dtype)
    if 'float' not in str(dtype):
        raise TypeError('Sparse containers only support float dtypes')

    sparse_array = sparse_series.values
    converted_sp_array = pd.SparseArray(sparse_array, dtype=dtype)

    converted_sp_series = pd.SparseSeries(converted_sp_array)
    return converted_sp_series


def convert_sparse_columns_dtype(sparse_dataframe, dtype):
    for col_name in sparse_dataframe:
        if isinstance(sparse_dataframe[col_name], pd.SparseSeries):
            sparse_dataframe.loc[:, col_name] = convert_sparse_series_dtype(
                 sparse_dataframe[col_name], dtype
            )

This achieves the stated purpose of reducing the sparse dataframe's memory footprint:

In []: sparse_df.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: float64(145)
memory usage: 1.1 MB

In []: convert_sparse_columns_dtype(sparse_df, 'float16')

In []: sparse_df.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: float16(145)
memory usage: 279.2 KB

In []: bool_df.equals(sparse_df.to_dense().astype('bool'))
Out[]: True

It is, however, a somewhat lousy solution, because the converted dataframe behaves unpredictibly when it interacts with other dataframes. For instance, when converted sparse dataframes are concatenated with other dataframes, all contained series become dense series. This is not the case for unconverted sparse dataframes. They remain sparse series in the resulting dataframe.