apply a function to a groupby function

2019-08-15 03:33发布

I want to count how many consistent increase, and the difference between the first element and the last element, on a groupby. But I can't apply the function on the groupby. After groupby, is it a list? And also what's the difference between "apply" and "agg"? Sorry, I just touched the python for a few days.

def promotion(ls):
    pro =0
    if len(ls)>1:
        for j in range(1,len(ls)):
            if ls[j]>ls[j-1]:
                pro + = 1
    return pro
def growth(ls):
    head= ls[0]
    tail= ls[len(ls)-1]
    gro= tail-head
    return gro
titlePromotion= JobData.groupby("candidate_id")["TitleLevel"].apply(promotion)
titleGrowth= JobData.groupby("candidate_id")["TitleLevel"].apply(growth)

The data is:

candidate_id    TitleLevel     othercols
1                 2              foo
2                 1              bar
2                 2              goo
2                 1              gar
The result should be
titlePromotion
candidate_id 
1                  0
2                  1
titleGrowth
candidate_id
1               0
2               0

2条回答
乱世女痞
2楼-- · 2019-08-15 04:10
VAR0    VAR1
1       1 
1       2
1       3
1       4
2       5
2       6
2       7
2       8

you could jus use lambda in apply like that:

the code below would substract all values from the first one

grp = df.groupby('VAR0')['VAR1'].apply(lambda x: x.iloc[0] - x)

if you try that with agg:

grp = df.groupby('VAR0')['VAR1'].agg(lambda x: x.iloc[0] - x)

it won't work because agg needs to get one value for each group

if you subtract values of a particular cells, there's no difference between agg and apply, they both create a one value for each group

grp = df.groupby('VAR0')['VAR1'].apply(lambda x: x.iloc[0] - x.iloc[-1])
grp = df.groupby('VAR0')['VAR1'].agg(lambda x: x.iloc[0] - x.iloc[-1])

print grp

VAR0
1      -3
2      -3
Name: VAR1, dtype: int64

if you would like for example substract each row value from the previous row (to get the increment for each row), you could use transform like that:

grp = df.groupby('VAR0')

def subtr(x):
    y=x.copy()
    for i in range(1,len(x.index)):
        x.iloc[i]=y.iloc[i]-y.iloc[i-1]
    return x

new_var = grp['VAR1'].transform(subtr)
print new_var

0    1
1    1
2    1
3    1
4    5
5    1
6    1
7    1
Name: VAR1, dtype: int64

or more easily, for this particular problem:

grp = df.groupby('VAR0')['VAR1'].apply(lambda x: x - x.shift())
查看更多
在下西门庆
3楼-- · 2019-08-15 04:23
import pandas as pd

def promotion(ls):
    return (ls.diff() > 0).sum()

def growth(ls):
    return ls.iloc[-1] - ls.iloc[0]

jobData = pd.DataFrame(
    {'candidate_id': [1, 2, 2, 2],
     'TitleLevel': [2, 1, 2, 1]})

grouped = jobData.groupby("candidate_id")
titlePromotion = grouped["TitleLevel"].agg(promotion)
print(titlePromotion)
# candidate_id
# 1               0
# 2               1
# dtype: int64

titleGrowth = grouped["TitleLevel"].agg(growth)
print(titleGrowth)
# candidate_id
# 1               0
# 2               0
# dtype: int64

Some tips:

If you define the generic function

def foo(ls):
    print(type(ls))

and call

jobData.groupby("candidate_id")["TitleLevel"].apply(foo)

Python will print

<class 'pandas.core.series.Series'>

This is a low-brow but effective way to discover that calling jobData.groupby(...)[...].apply(foo) passes a Series to foo.


The apply method calls foo once for every group. It can return a Series or a DataFrame with the resulting chunks glued together. It is possible to use apply when foo returns an object such as a numerical value or string, but in such cases I think using agg is preferred. A typical use case for using apply is when you want to, say, square every value in a group and thus need to return a new group of the same shape.

The transform method is also useful in this situation -- when you want to transform every value in the group and thus need to return something of the same shape -- but the result can be different than that with apply since a different object may be passed to foo (for example, each column of a grouped dataframe would be passed to foo when using transform, while the entire group would be passed to foo when using apply. The easiest way to understand this is to experiment with a simple dataframe and the generic foo.)

The agg method calls foo once for every group, but unlike apply it should return a single number per group. The group is aggregated into a value. A typical use case for using agg is when you want to count the number of items in the group.


You can debug and understand what went wrong with your original code by using the generic foo function:

In [30]: grouped['TitleLevel'].apply(foo)
0    2
Name: 1, dtype: int64
--------------------------------------------------------------------------------
1    1
2    2
3    1
Name: 2, dtype: int64
--------------------------------------------------------------------------------
Out[30]: 
candidate_id
1               None
2               None
dtype: object

This shows you the Series that are being passed to foo. Notice that in the second Series, then index values are 1 and 2. So ls[0] raises a KeyError, since there is no label with value 0 in the second Series.

What you really want is the first item in the Series. That is what iloc is for.

So to summarize, use ls[label] to select the row of a Series with index value of label. Use ls.iloc[n] to select the nth row of the Series.

Thus, to fix your code with a the least amount of change, you could use

def promotion(ls):
    pro =0
    if len(ls)>1:
        for j in range(1,len(ls)):
            if ls.iloc[j]>ls.iloc[j-1]:
                pro += 1
    return pro
def growth(ls):
    head= ls.iloc[0]
    tail= ls.iloc[len(ls)-1]
    gro= tail-head
    return gro
查看更多
登录 后发表回答