How to get rows in pandas data frame, with maximal

I have a pandas data frame. In the first column it can have the same value several times (in other words, the values in the first column are not unique).

Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. I almost found a solution:

import pandas

ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})

df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))

As a result I get:

  c1 c2  c3
0  a  a   1
1  a  c   3
2  a  b   2
3  b  b  10
4  b  c  12
5  b  a   7
--------------------
   c1 c2  c3
c1          
a   a  c   3
b   b  c  12

My problem is that, I do not want to have c1 as index. What I want to have is following:

  c1 c2  c3
1  a  c   3
4  b  c  12

回答1:

When calling df.groupby(...).apply(foo), the type of object returned by foo affects the way the results are melded together.

If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister).

If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible).

So, you can arrange for the type of output you desire by converting your Series into a DataFrame.

With Pandas 0.13 you can use the to_frame().T method:

def maxrow(x, col):
    return x.loc[x[col].argmax()].to_frame().T

result = df.groupby('c1').apply(maxrow, 'c3')
result = result.reset_index(level=0, drop=True)
print(result)

yields

  c1 c2  c3
1  a  c   3
4  b  c  12

In Pandas 0.12 or older, the equivalent would be:

def maxrow(x, col):
    ser = x.loc[x[col].idxmax()]
    df = pd.DataFrame({ser.name: ser}).T
    return df

By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. The sort lifts the time complexity from O(n) to O(n log n) however, so it becomes slower than the to_frame solution shown above when applied to larger DataFrames.

Here is how I benchmarked it:

import pandas as pd
import numpy as np
import timeit


def reset_df_first(df):
    df2 = df.reset_index()
    result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
    result.set_index(['index'], inplace=True)
    return result

def maxrow(x, col):
    result = x.loc[x[col].argmax()].to_frame().T
    return result

def using_to_frame(df):
    result = df.groupby('c1').apply(maxrow, 'c3')
    result.reset_index(level=0, drop=True, inplace=True)
    return result

def using_sort(df):
    return df.sort('c3').groupby('c1', as_index=False).tail(1)


for N in (100, 1000, 2000):
    df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
                       'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
                       'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})

    df = pd.concat([df]*N)
    df.reset_index(inplace=True, drop=True)

    timing = dict()
    for func in (reset_df_first, using_to_frame, using_sort):
        timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
                              'import __main__ as m ',
                              number=10)

    print('For N = {}'.format(N))
    for func in sorted(timing, key=timing.get):
        print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
    print

yields

For N = 100
using_sort          : 0.018
using_to_frame      : 0.0265
reset_df_first      : 0.0303

For N = 1000
using_to_frame      : 0.0358    \
using_sort          : 0.036     / this is roughly where the two methods cross over in terms of performance
reset_df_first      : 0.0432

For N = 2000
using_to_frame      : 0.0457
reset_df_first      : 0.0523
using_sort          : 0.0569

(reset_df_first was another possibility I tried.)

回答2:

try this:

df.sort('c3').groupby('c1', as_index=False).tail(1)