Building on this answer, and given that
>>> df
columnA columnB columnC
0 cat1 3 400
1 cat1 2 20
2 cat1 5 3029
3 cat2 1 492
4 cat2 4 30
5 cat3 2 203
6 cat3 6 402
7 cat3 4 391
>>> df.groupby(['columnA']).agg({'columnA':'size','columnB':'min'}).rename(columns={'columnA':'size'})
size min
columnA
cat1 3 2
cat2 2 1
cat3 3 2
I want to obtain a DataFrame containing also the value of columnC corresponding to (on the same row of) the displayed minimum value of columnB, that is:
size min columnC
columnA
cat1 3 2 20
cat2 2 1 492
cat3 3 2 203
Of course this is possible only for those aggregating functions (like min or max) which 'pick' a value from the group rather than 'aggregate' (like sum or average).
Any clue?
Thanks in advance.
You can use
idxmin
to pull out the row indices of those rows:You can append this as a column to
res
:Since the result you are looking for is essentially a join on
['columnA', 'columnB']
, you can obtain the desired DataFrame usingprovided we setup
result
with the right column names:yields
On reason why you might want to use
pd.merge
instead ofgroupby/apply
is becausegroupby/apply
calls a function for each group. If there are a lot of groups, this can be slow.For example, if you had a 10000-row DataFrame with 1000 groups,
then
using_merge
(below) is ~ 250x faster thanusing_apply
: