I have a pandas data frame. In the first column it can have the same value several times (in other words, the values in the first column are not unique).
Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. I almost found a solution:
import pandas
ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})
df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))
As a result I get:
c1 c2 c3
0 a a 1
1 a c 3
2 a b 2
3 b b 10
4 b c 12
5 b a 7
--------------------
c1 c2 c3
c1
a a c 3
b b c 12
My problem is that, I do not want to have c1
as index. What I want to have is following:
c1 c2 c3
1 a c 3
4 b c 12
When calling df.groupby(...).apply(foo)
, the type of object returned by foo
affects the way the results are melded together.
If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister).
If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible).
So, you can arrange for the type of output you desire by converting your Series into a DataFrame.
With Pandas 0.13 you can use the to_frame().T
method:
def maxrow(x, col):
return x.loc[x[col].argmax()].to_frame().T
result = df.groupby('c1').apply(maxrow, 'c3')
result = result.reset_index(level=0, drop=True)
print(result)
yields
c1 c2 c3
1 a c 3
4 b c 12
In Pandas 0.12 or older, the equivalent would be:
def maxrow(x, col):
ser = x.loc[x[col].idxmax()]
df = pd.DataFrame({ser.name: ser}).T
return df
By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames.
The sort
lifts the time complexity from O(n)
to O(n log n)
however, so it becomes slower than the to_frame
solution shown above when applied to larger DataFrames.
Here is how I benchmarked it:
import pandas as pd
import numpy as np
import timeit
def reset_df_first(df):
df2 = df.reset_index()
result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
result.set_index(['index'], inplace=True)
return result
def maxrow(x, col):
result = x.loc[x[col].argmax()].to_frame().T
return result
def using_to_frame(df):
result = df.groupby('c1').apply(maxrow, 'c3')
result.reset_index(level=0, drop=True, inplace=True)
return result
def using_sort(df):
return df.sort('c3').groupby('c1', as_index=False).tail(1)
for N in (100, 1000, 2000):
df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})
df = pd.concat([df]*N)
df.reset_index(inplace=True, drop=True)
timing = dict()
for func in (reset_df_first, using_to_frame, using_sort):
timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
'import __main__ as m ',
number=10)
print('For N = {}'.format(N))
for func in sorted(timing, key=timing.get):
print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
print
yields
For N = 100
using_sort : 0.018
using_to_frame : 0.0265
reset_df_first : 0.0303
For N = 1000
using_to_frame : 0.0358 \
using_sort : 0.036 / this is roughly where the two methods cross over in terms of performance
reset_df_first : 0.0432
For N = 2000
using_to_frame : 0.0457
reset_df_first : 0.0523
using_sort : 0.0569
(reset_df_first
was another possibility I tried.)
try this:
df.sort('c3').groupby('c1', as_index=False).tail(1)