I need to reproduce with pandas what SQL does so easily:
select
del_month
, sum(case when off0_on1 = 1 then 1 else 0 end) as on1
, sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month
Here is a sample, illustrative pandas dataframe to work on:
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})
Here are my attempts to reproduce the above SQL with pandas. The first line works. The second line gives an error:
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))
Here's the second line's error:
TypeError: 'function' object is not iterable
This previous question of mine had a problem with the lambda function, which was solved. The bigger problem is how to reproduce SQL's "sum(case when)" logic on grouped data. I'm looking for a general solution, since I need to do this sort of thing often. The answers in my previous question suggested using map() inside the lambda function, but the following results for the "off0" column are not what I need. The "on1" column is what I want. The answer should be the same for the whole group (i.e. "del_month").
![](https://www.manongdao.com/static/images/pcload.jpg)
Simply sum the Trues in your conditional logic expressions:
import pandas as pd
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2],
'off0_on1':[0,0,1,1,0,1,1,1]})
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))
print(a1)
# del_month off0_on1 on1 off0
# 0 1 0 2 2
# 1 1 0 2 2
# 2 1 1 2 2
# 3 1 1 2 2
# 4 2 0 3 1
# 5 2 1 3 1
# 6 2 1 3 1
# 7 2 1 3 1
Similarly, you can do the same in SQL if dialect supports it which most should:
select
del_month
, sum(off0_on1 = 1) as on1
, sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month
And to replicate above SQL in pandas, don't use transform
but send multiple aggregates in a groupby().apply()
call:
def aggfunc(x):
data = {'on1': sum(x['off0_on1'] == 1),
'off0': sum(x['off0_on1'] == 0)}
return pd.Series(data)
g = a1.groupby('del_month').apply(aggfunc)
print(g)
# on1 off0
# del_month
# 1 2 2
# 2 3 1
Using get_dummies
would only need a single groupby
call, which is simpler.
v = pd.get_dummies(df.pop('off0_on1')).groupby(df.del_month).transform(sum)
df = pd.concat([df, v.rename({0: 'off0', 1: 'on1'}, axis=1)], axis=1)
df
del_month off0 on1
0 1 2 2
1 1 2 2
2 1 2 2
3 1 2 2
4 2 1 3
5 2 1 3
6 2 1 3
7 2 1 3
Additionally, for the case of aggregation, call sum
directly instead of using apply
:
(pd.get_dummies(df.pop('off0_on1'))
.groupby(df.del_month)
.sum()
.rename({0: 'off0', 1: 'on1'}, axis=1))
off0 on1
del_month
1 2 2
2 1 3