This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
I just did this
All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
... gives ...
In this generalized case we would like to group by
category
andname
, and impute only onvalue
.This can be solved as follows:
Notice the column list in the group-by clause, and that we select the
value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.Performance test by increasing the dataset by doing ...
... confirms that this increases the speed proportional to how many columns you don't have to impute:
On a final note you can generalize even further if you want to impute more than one column, but not all:
I'd do it this way
fillna
+groupby
+transform
+mean
This seems intuitive:
The
groupby
+transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymouslambda
function.