I have a dataset will some missing data that looks like this:
id category value
1 A NaN
2 B NaN
3 A 10.5
4 C NaN
5 A 2.0
6 B 1.0
I need to fill in the nulls to use the data in a model. Every time a category occurs for the first time it is NULL. The way I want to do is for cases like category A
and B
that have more than one value replace the nulls with the average of that category. And for category C
with only single occurrence just fill in the average of the rest of the data.
I know that I can simply do this for cases like C
to get the average of all the rows but I'm stuck trying to do the categorywise means for A and B and replacing the nulls.
df['value'] = df['value'].fillna(df['value'].mean())
I need the final df to be like this
id category value
1 A 6.25
2 B 1.0
3 A 10.5
4 C 4.15
5 A 2.0
6 B 1.0
I think you can use
groupby
andapply
fillna
withmean
. Then getNaN
if some category has onlyNaN
values, so usemean
of all values of column for fillingNaN
:You can also use
GroupBy
+transform
to fillNaN
values with groupwise means. This method avoids inefficientapply
+lambda
. For example: