I want to count how many consistent increase, and the difference between the first element and the last element, on a groupby. But I can't apply the function on the groupby. After groupby, is it a list? And also what's the difference between "apply" and "agg"? Sorry, I just touched the python for a few days.
def promotion(ls):
pro =0
if len(ls)>1:
for j in range(1,len(ls)):
if ls[j]>ls[j-1]:
pro + = 1
return pro
def growth(ls):
head= ls[0]
tail= ls[len(ls)-1]
gro= tail-head
return gro
titlePromotion= JobData.groupby("candidate_id")["TitleLevel"].apply(promotion)
titleGrowth= JobData.groupby("candidate_id")["TitleLevel"].apply(growth)
The data is:
candidate_id TitleLevel othercols
1 2 foo
2 1 bar
2 2 goo
2 1 gar
The result should be
titlePromotion
candidate_id
1 0
2 1
titleGrowth
candidate_id
1 0
2 0
import pandas as pd
def promotion(ls):
return (ls.diff() > 0).sum()
def growth(ls):
return ls.iloc[-1] - ls.iloc[0]
jobData = pd.DataFrame(
{'candidate_id': [1, 2, 2, 2],
'TitleLevel': [2, 1, 2, 1]})
grouped = jobData.groupby("candidate_id")
titlePromotion = grouped["TitleLevel"].agg(promotion)
print(titlePromotion)
# candidate_id
# 1 0
# 2 1
# dtype: int64
titleGrowth = grouped["TitleLevel"].agg(growth)
print(titleGrowth)
# candidate_id
# 1 0
# 2 0
# dtype: int64
Some tips:
If you define the generic function
def foo(ls):
print(type(ls))
and call
jobData.groupby("candidate_id")["TitleLevel"].apply(foo)
Python will print
<class 'pandas.core.series.Series'>
This is a low-brow but effective way to discover that calling jobData.groupby(...)[...].apply(foo)
passes a Series
to foo
.
The apply
method calls foo
once for every group. It can return a Series or a DataFrame with the resulting chunks glued together. It is possible to use apply
when foo
returns an object such as a numerical value or string, but in such cases I think using agg
is preferred. A typical use case for using apply
is when you want to, say, square every value in a group and thus need to return a new group of the same shape.
The transform
method is also useful in this situation -- when you want to transform every value in the group and thus need to return something of the same shape -- but the result can be different than that with apply
since a different object may be passed to foo
(for example, each column of a grouped dataframe would be passed to foo
when using transform
, while the entire group would be passed to foo
when using apply
. The easiest way to understand this is to experiment with a simple dataframe and the generic foo
.)
The agg
method calls foo
once for every group, but unlike apply
it should return a single number per group. The group is aggregated into a value. A typical use case for using agg
is when you want to count the number of items in the group.
You can debug and understand what went wrong with your original code by using the generic foo
function:
In [30]: grouped['TitleLevel'].apply(foo)
0 2
Name: 1, dtype: int64
--------------------------------------------------------------------------------
1 1
2 2
3 1
Name: 2, dtype: int64
--------------------------------------------------------------------------------
Out[30]:
candidate_id
1 None
2 None
dtype: object
This shows you the Series that are being passed to foo
. Notice that in the second Series, then index values are 1 and 2. So ls[0]
raises a KeyError
, since there is no label with value 0
in the second Series.
What you really want is the first item in the Series. That is what iloc
is for.
So to summarize, use ls[label]
to select the row of a Series with index value of label
. Use ls.iloc[n]
to select the n
th row of the Series.
Thus, to fix your code with a the least amount of change, you could use
def promotion(ls):
pro =0
if len(ls)>1:
for j in range(1,len(ls)):
if ls.iloc[j]>ls.iloc[j-1]:
pro += 1
return pro
def growth(ls):
head= ls.iloc[0]
tail= ls.iloc[len(ls)-1]
gro= tail-head
return gro
VAR0 VAR1
1 1
1 2
1 3
1 4
2 5
2 6
2 7
2 8
you could jus use lambda in apply like that:
the code below would substract all values from the first one
grp = df.groupby('VAR0')['VAR1'].apply(lambda x: x.iloc[0] - x)
if you try that with agg:
grp = df.groupby('VAR0')['VAR1'].agg(lambda x: x.iloc[0] - x)
it won't work because agg needs to get one value for each group
if you subtract values of a particular cells, there's no difference between agg and apply, they both create a one value for each group
grp = df.groupby('VAR0')['VAR1'].apply(lambda x: x.iloc[0] - x.iloc[-1])
grp = df.groupby('VAR0')['VAR1'].agg(lambda x: x.iloc[0] - x.iloc[-1])
print grp
VAR0
1 -3
2 -3
Name: VAR1, dtype: int64
if you would like for example substract each row value from the previous row (to get the increment for each row), you could use transform like that:
grp = df.groupby('VAR0')
def subtr(x):
y=x.copy()
for i in range(1,len(x.index)):
x.iloc[i]=y.iloc[i]-y.iloc[i-1]
return x
new_var = grp['VAR1'].transform(subtr)
print new_var
0 1
1 1
2 1
3 1
4 5
5 1
6 1
7 1
Name: VAR1, dtype: int64
or more easily, for this particular problem:
grp = df.groupby('VAR0')['VAR1'].apply(lambda x: x - x.shift())