When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices
in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx
indexes doesn't matter)
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
Edit: changed
labels
properties tocodes
as the former seem to be deprecatedEdit2: Added a separator as suggested by Authman Apatira
Here's a concise way using
drop_duplicates
andmerge
to get a unique identifier.The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional
reset_index(drop=True)
.Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the
ngroup
method as noted in a comment to the question above by @Constantino and a subsequent answer by @CalumYou. I'll leave this here as an alternate approach butngroup
seems like the better way to do this in most cases.A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
Timing results:
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
That would generate an unique idx for each combination of
a
andb
.But this is still a rather silly index (think about some more complex values in columns
a
andb
. So let's clear the index:That would produce the desired output:
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
Output
So breaking this up into steps, lets see the output of
df.sort_values(['a', 'b']).diff().fillna(0)
which checks if each row is different than the previous row. Any non-zero entry indicates a new group.A new group only need to have a single column different so this is what
.ne(0).any(1)
checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.Answer for columns as strings
output of
df1
Take similar approach by checking if group has changed
Here is the solution using
ngroup
from a comment above by Constantino, for those still looking for this function (the equivalent ofdplyr::group_indices
in R, if you were trying to google with those keywords like me). This is also about 25% faster than the solution given by maxliving according to my own timing.