Let's say I have categories, 1 to 10, and I want to assign red
to value 3 to 5, green
to 1,6, and 7, and blue
to 2, 8, 9, and 10.
How would I do this? If I try
df.cat.rename_categories(['red','green','blue'])
I get an error: ValueError: new categories need to have the same number of items than the old categories!
but if I put this in
df.cat.rename_categories(['green','blue','red', 'red', 'red'
'green', 'green', 'blue', 'blue' 'blue'])
I'll get an error saying that there are duplicate values.
The only other method I can think of is to write a for loop that'll go through a dictionary of the values and replace them. Is there a more elegant of resolving this?
Seems
pandas.explode
released withpandas-0.25.0
(July 18, 2019)
would fit right in there and hence avoid any looping -So, the result is a pandas series that has all the required mappings from
values:index
. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.1) Output as dict :
2) Output as series :
Can be this way:
Not sure about elegance, but if you make a dict of the old to new categories, something like (note the added 'purple'):
You can use this to build a new categorical Series:
You don't need the
categories=set(m2.values())
(or an ordered equivalent if you care about the categorical ordering) if you're sure that all categorical values will be seen in the column. But here, if we didn't do that, we wouldn't have seenpurple
in the resulting Categorical, because it was building it from the categories it actually saw.Of course if you already have your list
['green','blue','red', etc.]
built it's equally easy just to use it to make a new categorical column directly and bypass this mapping entirely.OK, this is slightly simpler, hopefully will stimulate further conversation.
OP's example input:
This yields
ValueError: Categorical categories must be unique
as OP states.My solution:
Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.
I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):
EDIT 5/2/20: Further simplified
df.numbers.apply(lambda x: remap_cat_dict[x])
withdf.numbers.map(remap_cat_dict)
(thanks @JohnE)I certainly don't see an issue with @DSM's original answer here, but that dictionary comprehension might not be the easiest thing to read for some (although is a fairly standard approach in Python).
If you don't want to use a dictionary comprehension but are willing to use
numpy
then I would suggestnp.select
which is roughly as concise as @DSM's answer but perhaps a little more straightforward to read, like @vector07's answer.Output (note this is a string or object column, but of course you can easily convert to a category with
astype('category')
:It's basically the same thing, but you could also do this with
np.where
:That's not going to be as efficient as
np.select
which is probably the most efficient way to do this (although I didn't time it), but it is arguably more readable in that you can put each key/value pair on the same line.