I have a pandas data frame. In the first column it can have the same value several times (in other words, the values in the first column are not unique).
Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. I almost found a solution:
import pandas
ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})
df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))
As a result I get:
c1 c2 c3
0 a a 1
1 a c 3
2 a b 2
3 b b 10
4 b c 12
5 b a 7
--------------------
c1 c2 c3
c1
a a c 3
b b c 12
My problem is that, I do not want to have c1
as index. What I want to have is following:
c1 c2 c3
1 a c 3
4 b c 12
try this:
When calling
df.groupby(...).apply(foo)
, the type of object returned byfoo
affects the way the results are melded together.If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister).
If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible).
So, you can arrange for the type of output you desire by converting your Series into a DataFrame.
With Pandas 0.13 you can use the
to_frame().T
method:yields
In Pandas 0.12 or older, the equivalent would be:
By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. The
sort
lifts the time complexity fromO(n)
toO(n log n)
however, so it becomes slower than theto_frame
solution shown above when applied to larger DataFrames.Here is how I benchmarked it:
yields
(
reset_df_first
was another possibility I tried.)