Pandas - Replace outliers with groupby mean

2020-06-26 19:13发布

I have a pandas dataframe which I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the mean of the group. Outliers are defined as such if they are more than 3 standard deviations away from the group mean.

df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})

I thought that the following would work:

df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))

but get the following error:

NameError: name 'x' is not defined

I have also tried defining a transform function separately:

def trans_func(x):
    mean = x.mean()
    std = x.std()
    length = len(x)
    for i in range(0,length):
        if abs(x[i]-mean)<=(3*std):
            return x
        else:
            return mean

and then calling it like so:

df.groupby('a')['b'].transform(lambda x: trans_func(x))

but I get a different error:

KeyError: 0

Finally, I resorted to creating a separate column altogether:

df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']] 

but this hasn't worked either:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any advice much appreciated.

标签: python pandas
3条回答
爷、活的狠高调
2楼-- · 2020-06-26 19:43

It would be more appropriate to first remove outliers and then calculate group means for replacement. If a mean for replacement is calculated with outliers the mean is affected by the outliers

查看更多
霸刀☆藐视天下
3楼-- · 2020-06-26 19:58

Try this:

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

Note: If you want to eliminate the 100 in your last group you can replace 3*std by just 1*std. The standard deviation in this group is 48.33 so it would be included in the result.

查看更多
狗以群分
4楼-- · 2020-06-26 20:05

Hope this would be helpful:

Step 1, remove outliers (reference from pandas group by remove outliers):

def is_outlier(s):
    lower_limit = s.mean() - (s.std() * 3)
    upper_limit = s.mean() + (s.std() * 3)
    return ~s.between(lower_limit, upper_limit)

df = df[~df.groupby('a')['count'].apply(is_outlier)]

Step 2, replace outlier (reference from elyase):

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)
查看更多
登录 后发表回答