A Faster Way of Removing Unused Categories in Pand

2020-04-12 08:42发布

问题:

I'm running some models in Python, with data subset on categories.

For memory usage, and preprocessing, all the categorical variables are stored as category data type.

For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.

I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).

Here is a simplified example:

import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})

#convert to category datatype
z.x = z.x.astype('category')

#groupby
z = z.groupby('x')

#loop over groups
for i in z.groups:
    x = z.get_group(i)
    x.x = x.x.cat.remove_unused_categories()
    #run my fancy model here

On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.

Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.

回答1:

Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change

for i in z.groups:
    x = z.get_group(i).copy() # will no longer be tied to z
    x.x = x.x.cat.remove_unused_categories()