A Faster Way of Removing Unused Categories in Pand

I'm running some models in Python, with data subset on categories.

For memory usage, and preprocessing, all the categorical variables are stored as category data type.

For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.

I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).

Here is a simplified example:

import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})

#convert to category datatype
z.x = z.x.astype('category')

#groupby
z = z.groupby('x')

#loop over groups
for i in z.groups:
    x = z.get_group(i)
    x.x = x.x.cat.remove_unused_categories()
    #run my fancy model here

On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.

Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.

Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change

for i in z.groups:
    x = z.get_group(i).copy() # will no longer be tied to z
    x.x = x.x.cat.remove_unused_categories()

A Faster Way of Removing Unused Categories in Pand

问题:

回答1:

收藏的人(0)

A Faster Way of Removing Unused Categories in Pand

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮