A Faster Way of Removing Unused Categories in Pand

2020-04-12 08:16发布

I'm running some models in Python, with data subset on categories.

For memory usage, and preprocessing, all the categorical variables are stored as category data type.

For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.

I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).

Here is a simplified example:

import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})

#convert to category datatype
z.x = z.x.astype('category')

#groupby
z = z.groupby('x')

#loop over groups
for i in z.groups:
    x = z.get_group(i)
    x.x = x.x.cat.remove_unused_categories()
    #run my fancy model here

On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.

Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.

1条回答
聊天终结者
2楼-- · 2020-04-12 09:18

Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change

for i in z.groups:
    x = z.get_group(i).copy() # will no longer be tied to z
    x.x = x.x.cat.remove_unused_categories()
查看更多
登录 后发表回答