Pandas: filling missing values by mean in each gro

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

标签： python pandas

8条回答

裙下三千臣

2楼-- · 2019-01-01 13:48

I just did this

df.fillna(df.mean(), inplace=True)

All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.

0人赞添加讨论(0) 举报

临风纵饮

3楼-- · 2019-01-01 13:51

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
    lambda x: x.fillna(x.mean()))

0人赞添加讨论(0) 举报

流年柔荑漫光年

4楼-- · 2019-01-01 13:52

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(
    {
        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],
        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
    }
)

... gives ...

  category name  other_value value
0        X    A         10.0   1.0
1        X    A          NaN   NaN
2        X    B          NaN   NaN
3        X    B         20.0   2.0
4        X    B         30.0   3.0
5        X    B         10.0   1.0
6        Y    C         30.0   3.0
7        Y    C          NaN   NaN
8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

Performance test by increasing the dataset by doing ...

big_df = None
for _ in range(10000):
    if big_df is None:
        big_df = df.copy()
    else:
        big_df = pd.concat([big_df, df])
df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd
from datetime import datetime

def generate_data():
    ...

t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)

# 0:00:00.016012

t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
    .transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)

# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
    .transform(lambda x: x.fillna(x.mean()))

0人赞添加讨论(0) 举报

永恒的永恒

5楼-- · 2019-01-01 13:53

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

0人赞添加讨论(0) 举报

永恒的永恒

6楼-- · 2019-01-01 13:54

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

0人赞添加讨论(0) 举报

唯独是你

7楼-- · 2019-01-01 14:00

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.

0人赞添加讨论(0) 举报

1 2 下一页

Pandas: filling missing values by mean in each gro

fillna + groupby + transform + mean

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间

`fillna` + `groupby` + `transform` + `mean`