I have this data which have outlier . How can i find Mahalanobis disantance and use it to remove outlier.
标签:
machine-learning
相关问题
- How to conditionally scale values in Keras Lambda
- Trying to understand Pytorch's implementation
- ParameterError: Audio buffer is not finite everywh
- How to calculate logistic regression accuracy
- How to parse unstructured table-like data?
相关文章
- How to use cross_val_score with random_state
- How to measure overfitting when train and validati
- McNemar's test in Python and comparison of cla
- How to disable keras warnings?
- Invert MinMaxScaler from scikit_learn
- How should I vectorize the following list of lists
- ValueError: Unknown metric function when using cus
- F1-score per class for multi-class classification
I found @Nipun Wijerathne answer incomplete and a bit messy, so I decided to provide a MCVE for future readers (MCVE at the end actually :D), but first let me put some general guidelines:
As it's already mentioned, Euclidean Metric fails to find the correct distance because it tries to get ordinary straight-line distance. So if we have multi-dimension space of variables, two points may look to have same distance from the Mean but in reality one of them is far away from the data cloud (i.e. it's extreme value).
The solution is Mahalanobis Distance which makes something similar to the feature scaling via taking the Eigenvectors of the variables instead of the original axis.
It applies the following formula:
in which:
x
is the observation to find its distancem
is the mean of the observationsS
is the Covariance MatrixRefresher:
The Covariance represents the direction of the relationship between two variables (i.e. positive, negative or zero), so it shows strength of how one variable is related to changes of the other.
Implementation
Consider this 6x3 dataset example in which each row represents an input / example and each column represents a feature for the example:
First we need to create a Covariance Matrix of the features of each input, and that's why we set the parameter
rowvar
to False in the numpy.cov function, so each column represents a variable:Then we find the Inverse of the Covariance Matrix:
But before proceeding, we should check -as mentioned above- if the matrix and its inverse are Symmetric and Positive Definite, we use for this Cholesky Decomposition Algorithm, fortunately it's already implemented in numpy.linalg.cholesky:
Next, we find the mean
m
of the variables on each feature (shall I say dimension) and save them in an array like this:Note that I repeated each row just to avail of matrix subtraction as will be shown next.
Next, we find
x - m
(i.e. the differential) but we have already vectorizedvars_mean
so all we need to do is:Finally, apply the formula like this:
Note the followings:
number_of_features x number_of_features
diff
matrix is similar to the original data matrix:number_of_examples x number_of_features
diff[i]
(i.e. row) is1 x number_of_features
.diff[i].dot(inv_covariance_matrix)
will be1 x number_of_features
and when we multiply again bydiff[i]
numpy
automatically considers the later as a column matrix i.e.number_of_features x 1
so the final result will become a single value! (i.e. no need for transpose)In order to detect the outliers, we should specify the threshold; we do so by multiplying the mean of the Mahalanobis Distance Results by the extremeness degree
k
in whichk = 2.0 * std
for extreme values and3.0 * std
for the very extreme values and that's according to the 68–95–99.7 rule (image for illustration from same link):Putting All Together
Result
In multivariate data, Euclidean distance fails if there exists covariance between variables (i.e. in your case X, Y, Z).
Therefore, what Mahalanobis Distance does is,
It transforms the variables into uncorrelated space.
Make each variables varience equals to 1.
Then calculate the simple Euclidean distance.
We can calculate the Mahalanobis Distance for each data sample as follows,
Here, I have provided the python code and added the comments so that you can understand the code.
Hope this helps.
References,
http://mccormickml.com/2014/07/21/mahalanobis-distance/
http://kldavenport.com/mahalanobis-distance-and-outliers/