I have a dataframe which contains duplicates values according to two columns (A and B):
A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8
I want to remove duplicates keeping the row with max value in column C. This would lead to:
A B C
1 2 4
2 7 1
3 4 8
I cannot figure out how to do that. Should I use drop_duplicates()
, something else?
You can do it with
drop_duplicates
as you wantedIf it's important to get the same order
I think groupby should work.
If you need a dataframe back you can chain the reset index call.
You can do this simply by using pandas drop duplicates function
You can do it using group by:
c_maxes
is aSeries
of the maximum values ofC
in each group but which is of the same length and with the same index asdf
. If you haven't used.transform
then printingc_maxes
might be a good idea to see how it works.Another approach using
drop_duplicates
would beNot sure which is more efficient but I guess the first approach as it doesn't involve sorting.
EDIT: From
pandas 0.18
up the second solution would bedf.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
or, alternatively,df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])
. In any case, thegroupby
solution seems to be significantly more performing: