pandas: Combining Multiple Categories into One

2020-08-25 05:56发布

站内文章 / Python

109 0

老娘就宠你

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Let's say I have categories, 1 to 10, and I want to assign red to value 3 to 5, green to 1,6, and 7, and blue to 2, 8, 9, and 10.

How would I do this? If I try

df.cat.rename_categories(['red','green','blue'])

I get an error: ValueError: new categories need to have the same number of items than the old categories! but if I put this in

df.cat.rename_categories(['green','blue','red', 'red', 'red'
                        'green', 'green', 'blue', 'blue' 'blue'])

I'll get an error saying that there are duplicate values.

The only other method I can think of is to write a for loop that'll go through a dictionary of the values and replace them. Is there a more elegant of resolving this?

回答1:

Not sure about elegance, but if you make a dict of the old to new categories, something like (note the added 'purple'):

>>> m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10], "purple": [11]}
>>> m2 = {v: k for k,vv in m.items() for v in vv}
>>> m2
{1: 'green', 2: 'blue', 3: 'red', 4: 'red', 5: 'red', 6: 'green', 
 7: 'green', 8: 'blue', 9: 'blue', 10: 'blue', 11: 'purple'}

You can use this to build a new categorical Series:

>>> df.cat.map(m2).astype("category", categories=set(m2.values()))
0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: cat, dtype: category
Categories (4, object): [green, purple, red, blue]

You don't need the categories=set(m2.values()) (or an ordered equivalent if you care about the categorical ordering) if you're sure that all categorical values will be seen in the column. But here, if we didn't do that, we wouldn't have seen purple in the resulting Categorical, because it was building it from the categories it actually saw.

Of course if you already have your list ['green','blue','red', etc.] built it's equally easy just to use it to make a new categorical column directly and bypass this mapping entirely.

回答2:

Seems pandas.explode released with pandas-0.25.0 (July 18, 2019) would fit right in there and hence avoid any looping -

# Mapping dict
In [150]: m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}

In [151]: pd.Series(m).explode().sort_values()
Out[151]: 
green     1
blue      2
red       3
red       4
red       5
green     6
green     7
blue      8
blue      9
blue     10
dtype: object

So, the result is a pandas series that has all the required mappings from values:index. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.

# Mapping obtained
In [152]: s = pd.Series(m).explode().sort_values()

1) Output as dict :

In [153]: dict(zip(s.values, s.index))
Out[153]: 
{1: 'green',
 2: 'blue',
 3: 'red',
 4: 'red',
 5: 'red',
 6: 'green',
 7: 'green',
 8: 'blue',
 9: 'blue',
 10: 'blue'}

2) Output as series :

In [154]: pd.Series(s.index, s.values)
Out[154]: 
1     green
2      blue
3       red
4       red
5       red
6     green
7     green
8      blue
9      blue
10     blue
dtype: object

回答3:

I certainly don't see an issue with @DSM's original answer here, but that dictionary comprehension might not be the easiest thing to read for some (although is a fairly standard approach in Python).

If you don't want to use a dictionary comprehension but are willing to use numpy then I would suggest np.select which is roughly as concise as @DSM's answer but perhaps a little more straightforward to read, like @vector07's answer.

import numpy as np 

number = [ df.numbers.isin([3,4,5]), 
           df.numbers.isin([1,6,7]), 
           df.numbers.isin([2,8,9,10]),
           df.numbers.isin([11]) ]

color  = [ "red", "green", "blue", "purple" ]

df.numbers = np.select( number, color )

Output (note this is a string or object column, but of course you can easily convert to a category with astype('category'):

0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue

It's basically the same thing, but you could also do this with np.where:

df['numbers2'] = ''
df.numbers2 = np.where( df.numbers.isin([3,4,5]),    "red",    df.numbers2 ) 
df.numbers2 = np.where( df.numbers.isin([1,6,7]),    "green",  df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([2,8,9,10]), "blue",   df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([11]),       "purple", df.numbers2 )

That's not going to be as efficient as np.select which is probably the most efficient way to do this (although I didn't time it), but it is arguably more readable in that you can put each key/value pair on the same line.

回答4:

OK, this is slightly simpler, hopefully will stimulate further conversation.

OP's example input:

>>> my_data = {'numbers': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
>>> df = pd.DataFrame(data=my_data)
>>> df.numbers = df.numbers.astype('category')
>>> df.numbers.cat.rename_categories(['green','blue','red', 'red', 'red'
>>>                         'green', 'green', 'blue', 'blue' 'blue'])

This yields ValueError: Categorical categories must be unique as OP states.

My solution:

# write out a dict with the mapping of old to new
>>> remap_cat_dict = {
    1: 'green',
    2: 'blue',
    3: 'red',
    4: 'red',
    5: 'red',
    6: 'green',
    7: 'green',
    8: 'blue',
    9: 'blue',
    10: 'blue' }

>>> df.numbers = df.numbers.map(remap_cat_dict).astype('category')
>>> df.numbers
0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: numbers, dtype: category
Categories (3, object): [blue, green, red]

Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.

I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):

>>> df['colors'] = df.numbers.map(remap_cat_dict).astype('category')
>>> print(df)
  numbers colors
0       1  green
1       2   blue
2       3    red
3       4    red
4       5    red
5       6  green
6       7  green
7       8   blue
8       9   blue
9      10   blue

>>> df.colors

0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: colors, dtype: category
Categories (3, object): [blue, green, red]

EDIT 5/2/20: Further simplified df.numbers.apply(lambda x: remap_cat_dict[x]) with df.numbers.map(remap_cat_dict) (thanks @JohnE)

回答5:

Can be this way:

import pandas as pd
df = pd.DataFrame(range(1, 11), columns=['colors'])
color2cod = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}
cod2color = {cod: k for k, cods in color2cod.items() for cod in cods }

df['m'] = df.colors.map(cod2color.get)
df.m = df.m.astype('category')
print('---')
print(df.m.cat.categories)
print('---')
print(df.info())