Pandas - Replace outliers with groupby mean

I have a pandas dataframe which I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the mean of the group. Outliers are defined as such if they are more than 3 standard deviations away from the group mean.

df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})

I thought that the following would work:

df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))

but get the following error:

NameError: name 'x' is not defined

I have also tried defining a transform function separately:

def trans_func(x):
    mean = x.mean()
    std = x.std()
    length = len(x)
    for i in range(0,length):
        if abs(x[i]-mean)<=(3*std):
            return x
        else:
            return mean

and then calling it like so:

df.groupby('a')['b'].transform(lambda x: trans_func(x))

but I get a different error:

KeyError: 0

Finally, I resorted to creating a separate column altogether:

df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']]

but this hasn't worked either:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any advice much appreciated.

标签： python pandas

3条回答

爷、活的狠高调

2楼-- · 2020-06-26 19:43

It would be more appropriate to first remove outliers and then calculate group means for replacement. If a mean for replacement is calculated with outliers the mean is affected by the outliers

0人赞添加讨论(0) 举报

霸刀☆藐视天下

3楼-- · 2020-06-26 19:58

Try this:

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

Note: If you want to eliminate the 100 in your last group you can replace 3*std by just 1*std. The standard deviation in this group is 48.33 so it would be included in the result.

0人赞添加讨论(0) 举报

狗以群分

4楼-- · 2020-06-26 20:05

Hope this would be helpful:

Step 1, remove outliers (reference from pandas group by remove outliers):

def is_outlier(s):
    lower_limit = s.mean() - (s.std() * 3)
    upper_limit = s.mean() + (s.std() * 3)
    return ~s.between(lower_limit, upper_limit)

df = df[~df.groupby('a')['count'].apply(is_outlier)]

Step 2, replace outlier (reference from elyase):

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

0人赞添加讨论(0) 举报

Pandas - Replace outliers with groupby mean

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间