I have a data frame like this
lvl1=['l1A','l1A','l1B','l1C','l1D']
lvl2=['l2A','l2A','l2A','l26','l27']
wgt=[.2,.3,.15,.05,.3]
lvls=[lvl1,lvl2]
df=pd.DataFrame(wgt, lvls).reset_index()
df.columns = ['lvl' + str(i) for i in range(1,3)] + ['wgt']
df
lvl1 lvl2 wgt
0 l1A l2A 0.20
1 l1A l2A 0.30
2 l1B l2A 0.15
3 l1C l26 0.05
4 l1D l27 0.30
I want to get the average weight at each level and add them as a separate column to this data frame.
pd.concat([df, df.groupby('lvl1').transform('mean').add_suffix('_l1avg'), df.groupby('lvl2').transform('mean').add_suffix('_l2avg')], axis=1)
lvl1 lvl2 wgt wgt_l1avg wgt_l2avg
0 l1A l2A 0.20 0.25 0.216667
1 l1A l2A 0.30 0.25 0.216667
2 l1B l2A 0.15 0.15 0.216667
3 l1C l26 0.05 0.05 0.050000
4 l1D l27 0.30 0.30 0.300000
The levels can be more than two so I would like to do this using variable instead. What is the best and efficient way to do this as the dataset get to grow very large. I don't necessarily need these to be in the same data frame. It can be just a matrix of average weights in a separate n x m matrix (2 x 5) in this case.
Here is a non-pandas solution. From the resulting dictionary, it's possible to efficiently map to columns.
For this miniature dataset I see the following performance for 3 responses:
Use
list comprehension
: