Is there a built-in way to do filtering on a column by IQR(i.e. values between Q1-1.5IQR and Q3+1.5IQR)? also, any other possible generalized filtering in pandas suggested will be appreciated.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
As far as I know, the most compact notation seems to be brought by the query
method.
# Some test data
np.random.seed(33454)
df = (
# A standard distribution
pd.DataFrame({'nb': np.random.randint(0, 100, 20)})
# Adding some outliers
.append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))
# Reseting the index
.reset_index(drop=True)
)
# Computing IQR
Q1 = df['nb'].quantile(0.25)
Q3 = df['nb'].quantile(0.75)
IQR = Q3 - Q1
# Filtering Values between Q1-1.5IQR and Q3+1.5IQR
filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)')
Then we can plot the result to check the difference. We observe that the outlier in the left boxplot (the cross at 183) does not appear anymore in the filtered series.
# Ploting the result to check the difference
df.join(filtered, rsuffix='_filtered').boxplot()
Since this answer I've written a post on this topic were you may find more information.
回答2:
Another approach using Series.between():
iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)]
Drawn out:
q1 = df['col'].quantile(.25)
q3 = df['col'].quantile(.75)
mask = d['col'].between(q1, q2, inclusive=True)
iqr = d.loc[mask, 'col']
回答3:
This will give you the subset of df
which lies in the IQR of column column
:
def get_subset_by_IQR(df,column):
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = (df[column] >= q1) & (df[column] <= q3)
return df.loc[iqr]