I have a large dataset of the form:
period_id gic_subindustry_id operating_mgn_fym5 operating_mgn_fym4 317 201509 25101010 13.348150 11.745965
682 201509 20101010 10.228725 10.473917
903 201509 20101010 NaN 17.700966
1057 201509 50101010 27.858305 28.378040
1222 201509 25502020 15.598956 11.658813
2195 201508 25502020 27.688324 22.969760
2439 201508 45202020 NaN 27.145216
2946 201508 45102020 17.956425 18.327724
In practice, I have thousands of values for each year going back 25 years, and multiple (10+) columns.
I am trying to replace the NaN values with the gic_industry_id median/mean value for that time period.
I tried something along the lines of
df.fillna(df.groupby('period_id', 'gic_subindustry_id').transform('mean')), but this seemed to be painfully slow (I stopped it after several minutes).
It occurred to me that the reason it might be slow was due to re-calculating the mean for every NaN encountered. To get around this, I thought that calculating the mean at each period_id, and then replacing/mapping each NaN using this might be substantially faster.
means = df.groupby(['period_id', 'gic_subindustry_id']).apply(lambda x:x.mean())
Output:
operating_mgn_fym5 operating_mgn_fym4 operating_mgn_fym3 operating_mgn_fym2
period_id gic_subindustry_id
201509 45202030 1.622685 0.754661 0.755324 321.295665
45203010 1.447686 0.226571 0.334280 12.564398
45203015 0.733524 0.257581 0.345450 27.659407
45203020 1.322349 0.655481 0.468740 19.823722
45203030 1.461916 1.181407 1.487330 16.598534
45301010 2.074954 0.981030 0.841125 29.423161
45301020 2.621158 1.235087 1.550252 82.717147
And indeed, this is much faster (30 - 60 seconds).
However, I am struggling to figure out how to map the NaNs to these means. And, indeed, is this the 'correct' way of performing this mapping? Speed actually isn't of paramount importance, but < 60 seconds would be nice.