I have a pandas dataframe which I would like to split into groups, calculate the mean and standard deviation, and then replace all outliers with the mean of the group. Outliers are defined as such if they are more than 3 standard deviations away from the group mean.
df = pandas.DataFrame({'a': ['A','A','A','B','B','B','B'], 'b': [1.1,1.2,1.1,3.3,3.4,3.3,100.0]})
I thought that the following would work:
df.groupby('a')['b'].transform(lambda x: x[i] if np.abs(x[i]-x.mean())<=(3*x.std()) else x.mean() for i in range(0,len(x)))
but get the following error:
NameError: name 'x' is not defined
I have also tried defining a transform function separately:
def trans_func(x):
mean = x.mean()
std = x.std()
length = len(x)
for i in range(0,length):
if abs(x[i]-mean)<=(3*std):
return x
else:
return mean
and then calling it like so:
df.groupby('a')['b'].transform(lambda x: trans_func(x))
but I get a different error:
KeyError: 0
Finally, I resorted to creating a separate column altogether:
df['c'] = [df.groupby('a')['b'].transform(mean) if df.groupby('a')['b'].transform(lambda x: (x - x.mean()) / x.std()) > 3 else df['b']]
but this hasn't worked either:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any advice much appreciated.
It would be more appropriate to first remove outliers and then calculate group means for replacement. If a mean for replacement is calculated with outliers the mean is affected by the outliers
Try this:
Note: If you want to eliminate the 100 in your last group you can replace
3*std
by just1*std
. The standard deviation in this group is 48.33 so it would be included in the result.Hope this would be helpful:
Step 1, remove outliers (reference from pandas group by remove outliers):
Step 2, replace outlier (reference from elyase):