Pandas DataFrame - Aggregate on column whos dtype=

2019-06-26 06:12发布

问题:

I work with big dataframes with high memory usage and I read that if I change the dtype on repeated values columns I can save big amount of memory.

I tried it and indeed it dropped the memory usage by 25% but then I bumped into a performance slowness which I could not understand.

I do group-by aggregation on the dtype 'category' columns and before I changed the dtype it took about 1 second and after the change it took about 1 minute.

This code demonstrates the performance degradation by factor of 2:

import pandas as pd
import random

animals = ['Dog', 'Cat']
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday']

columns_dict = {'animals': [],
                'days': []}

for i in range(1000000):
    columns_dict['animals'].append(animals[random.randint(0, len(animals)-1)])
    columns_dict['days'].append(days[random.randint(0, len(days)-1)])

# df without 'category' dtype
df = pd.DataFrame(columns_dict)

df.info(memory_usage='deep') # will result in memory usage of 95.5 MB

%timeit -n100 df.groupby('days').agg({'animals': 'first'})
# will result in: 100 loops, best of 3: 54.2 ms per loop

# df with 'category' dtype
df2 = df.copy()
df2['animals'] = df2['animals'].astype('category')

df2.info(memory_usage='deep') # will result in memory usage of 50.7 MB

%timeit -n100 df2.groupby('days').agg({'animals': 'first'})
# will result in: 100 loops, best of 3: 111 ms per loop

What I try to understand is what is the cause of this slowness and if there is a way to overcome it.

Thanks!

回答1:

I'm not certain where this slowdown is coming from, but one workaround is to store the category codes directly:

df3 = df.copy()
animals = pd.Categorical(df['animals'])
df3['animals'] = animals.codes
df3.groupby('days').agg({'animals': 'first'}).apply(lambda code: animals.categories[code])

It's not the cleanest solution, because it requires external metadata, but it achieves both the memory efficiency and the computational speed you're looking for. It would be interesting to dig into what Pandas is doing internally that causes this slowdown for categoricals.


Edit: I tracked down why this happens... as part of the first() aggregation, pandas calls np.asarray() on the column. In the case of a categorical column, this ends up converting the column back to non-categoricals, leading to unnecessary overhead. Fixing this would be a useful contribution to the pandas package!