可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Let's say I have categories, 1 to 10, and I want to assign red
to value 3 to 5, green
to 1,6, and 7, and blue
to 2, 8, 9, and 10.
How would I do this? If I try
df.cat.rename_categories(['red','green','blue'])
I get an error: ValueError: new categories need to have the same number of items than the old categories!
but if I put this in
df.cat.rename_categories(['green','blue','red', 'red', 'red'
'green', 'green', 'blue', 'blue' 'blue'])
I'll get an error saying that there are duplicate values.
The only other method I can think of is to write a for loop that'll go through a dictionary of the values and replace them. Is there a more elegant of resolving this?
回答1:
Not sure about elegance, but if you make a dict of the old to new categories, something like (note the added 'purple'):
>>> m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10], "purple": [11]}
>>> m2 = {v: k for k,vv in m.items() for v in vv}
>>> m2
{1: 'green', 2: 'blue', 3: 'red', 4: 'red', 5: 'red', 6: 'green',
7: 'green', 8: 'blue', 9: 'blue', 10: 'blue', 11: 'purple'}
You can use this to build a new categorical Series:
>>> df.cat.map(m2).astype("category", categories=set(m2.values()))
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: cat, dtype: category
Categories (4, object): [green, purple, red, blue]
You don't need the categories=set(m2.values())
(or an ordered equivalent if you care about the categorical ordering) if you're sure that all categorical values will be seen in the column. But here, if we didn't do that, we wouldn't have seen purple
in the resulting Categorical, because it was building it from the categories it actually saw.
Of course if you already have your list ['green','blue','red', etc.]
built it's equally easy just to use it to make a new categorical column directly and bypass this mapping entirely.
回答2:
Seems pandas.explode
released with pandas-0.25.0
(July 18, 2019)
would fit right in there and hence avoid any looping -
# Mapping dict
In [150]: m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}
In [151]: pd.Series(m).explode().sort_values()
Out[151]:
green 1
blue 2
red 3
red 4
red 5
green 6
green 7
blue 8
blue 9
blue 10
dtype: object
So, the result is a pandas series that has all the required mappings from values:index
. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.
# Mapping obtained
In [152]: s = pd.Series(m).explode().sort_values()
1) Output as dict :
In [153]: dict(zip(s.values, s.index))
Out[153]:
{1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue'}
2) Output as series :
In [154]: pd.Series(s.index, s.values)
Out[154]:
1 green
2 blue
3 red
4 red
5 red
6 green
7 green
8 blue
9 blue
10 blue
dtype: object
回答3:
I certainly don't see an issue with @DSM's original answer here, but that dictionary comprehension might not be the easiest thing to read for some (although is a fairly standard approach in Python).
If you don't want to use a dictionary comprehension but are willing to use numpy
then I would suggest np.select
which is roughly as concise as @DSM's answer but perhaps a little more straightforward to read, like @vector07's answer.
import numpy as np
number = [ df.numbers.isin([3,4,5]),
df.numbers.isin([1,6,7]),
df.numbers.isin([2,8,9,10]),
df.numbers.isin([11]) ]
color = [ "red", "green", "blue", "purple" ]
df.numbers = np.select( number, color )
Output (note this is a string or object column, but of course you can easily convert to a category with astype('category')
:
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
It's basically the same thing, but you could also do this with np.where
:
df['numbers2'] = ''
df.numbers2 = np.where( df.numbers.isin([3,4,5]), "red", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([1,6,7]), "green", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([2,8,9,10]), "blue", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([11]), "purple", df.numbers2 )
That's not going to be as efficient as np.select
which is probably the most efficient way to do this (although I didn't time it), but it is arguably more readable in that you can put each key/value pair on the same line.
回答4:
OK, this is slightly simpler, hopefully will stimulate further conversation.
OP's example input:
>>> my_data = {'numbers': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
>>> df = pd.DataFrame(data=my_data)
>>> df.numbers = df.numbers.astype('category')
>>> df.numbers.cat.rename_categories(['green','blue','red', 'red', 'red'
>>> 'green', 'green', 'blue', 'blue' 'blue'])
This yields ValueError: Categorical categories must be unique
as OP states.
My solution:
# write out a dict with the mapping of old to new
>>> remap_cat_dict = {
1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue' }
>>> df.numbers = df.numbers.map(remap_cat_dict).astype('category')
>>> df.numbers
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: numbers, dtype: category
Categories (3, object): [blue, green, red]
Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.
I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):
>>> df['colors'] = df.numbers.map(remap_cat_dict).astype('category')
>>> print(df)
numbers colors
0 1 green
1 2 blue
2 3 red
3 4 red
4 5 red
5 6 green
6 7 green
7 8 blue
8 9 blue
9 10 blue
>>> df.colors
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: colors, dtype: category
Categories (3, object): [blue, green, red]
EDIT 5/2/20: Further simplified df.numbers.apply(lambda x: remap_cat_dict[x])
with df.numbers.map(remap_cat_dict)
(thanks @JohnE)
回答5:
Can be this way:
import pandas as pd
df = pd.DataFrame(range(1, 11), columns=['colors'])
color2cod = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}
cod2color = {cod: k for k, cods in color2cod.items() for cod in cods }
df['m'] = df.colors.map(cod2color.get)
df.m = df.m.astype('category')
print('---')
print(df.m.cat.categories)
print('---')
print(df.info())