The reason why I want to use a smaller data type in the sparse pandas containers is to reduce memory usage. This is relevant when working with data that originally uses bool (e.g. from to_dummies
) or small numeric dtypes (e.g. int8), which are all converted to float64 in sparse containers.
DataFrame creation
The provided example uses a modest 20k x 145 dataframe. In practice I'm working with dataframes in the order of 1e6 x 5e3.
In []: bool_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: bool(145)
memory usage: 2.7 MB
In []: bool_df.memory_usage(index=False).sum()
Out[]: 2878105
In []: bool_df.values.itemsize
Out[]: 1
A sparse version of this dataframe needs less memory, but is still much larger than needed, given the original dtype.
In []: sparse_df = bool_df.to_sparse(fill_value=False)
In []: sparse_df.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 19849 entries, 0 to 19848
Columns: 145 entries, topic.party_nl.p.pvda to topic.sub_cat_Reizen
dtypes: float64(145)
memory usage: 1.1 MB
In []: sparse_df.memory_usage(index=False).sum()
Out[]: 1143456
In []: sparse_df.values.itemsize
Out[]: 8
Even though this data is fairly sparse, the dtype conversion from bool to float64 causes non-fill values to take up 8x more space.
In []: sparse_df.memory_usage(index=False).describe()
Out[]:
count 145.000000
mean 7885.903448
std 17343.762402
min 8.000000
25% 640.000000
50% 1888.000000
75% 4440.000000
max 84688.000000
Given the sparsity of the data, one would hope for a more drastic reduction in memory size:
In []: sparse_df.density
Out[]: 0.04966184346992205
Memory footprint of underlying storage
The columns of SparseDataFrame
are SparseSeries
, which use SparseArray
as a wrapper for the underlying numpy.ndarray
storage. The number of bytes that are used by the sparse dataframe can (also) be computed directly from these ndarrays:
In []: col64_nbytes = [
.....: sparse_df[col].values.sp_values.nbytes
.....: for col in sparse_df
.....: ]
In []: sum(col64_nbytes)
Out[]: 1143456
The ndarrays can be converted to use smaller floats, which allows one to calculate how much memory the dataframe would need when using e.g. float16s. This would result in a 4x smaller dataframe, as one might expect.
In []: col16_nbytes = [
.....: sparse_df[col].values.sp_values.astype('float16').nbytes
.....: for col in sparse_df
.....: ]
In []: sum(col16_nbytes)
Out[]: 285864
By using the more appropriate dtype, the memory usage can be reduced to 10% of the dense version, whereas the float64 sparse dataframe reduces to 40%. For my data, this could make the difference between needing 20 GB and 5 GB of available memory.
In []: sum(col64_nbytes) / bool_df.memory_usage(index=False).sum()
Out[]: 0.3972947477593764
In []: sum(col16_nbytes) / bool_df.memory_usage(index=False).sum()
Out[]: 0.0993236869398441
Issue
Unfortunately, dtype conversion of sparse containers has not been implemented in pandas:
In []: sparse_df.astype('float16')
---------------------------------------------------
[...]/pandas/sparse/frame.py in astype(self, dtype)
245
246 def astype(self, dtype):
--> 247 raise NotImplementedError
248
249 def copy(self, deep=True):
NotImplementedError:
How can the SparseSeries
in a SparseDataFrame
be converted to use the numpy.float16
data type, or another dtype that uses fewer than 64 bytes per item, instead of the default numpy.float64
?