I have a pandas dataframe with few columns.
Now I know that certain rows are outliers based on a certain column value.
For instance columns - 'Vol' has all values around 12xx and one value is 4000 (Outlier).
Now I would like to exclude those rows that have 'Vol' Column like this. So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within say 3 standard deviations from mean.
What is an elegant way to achieve this.
a full example with data and 2 groups follows:
Imports:
Data example with 2 groups: G1:Group 1. G2: Group 2:
Read text data to pandas dataframe:
Define the outliers using standard deviations
Define filtered data values and the outliers:
Print the result:
Since I am in a very early stage of my data science journey, I am treating outliers with the code below.
Use
boolean
indexing as you would do innumpy.array
For a series it is similar:
My function for dropping outliers
For each of your dataframe column, you could get quantile with:
and then filter with: