Multivariate Outlier Removal With Mahalanobis Dist

I have this data which have outlier . How can i find Mahalanobis disantance and use it to remove outlier.

回答1:

I found @Nipun Wijerathne answer incomplete and a bit messy, so I decided to provide a MCVE for future readers (MCVE at the end actually :D), but first let me put some general guidelines:

Practically speaking, if you have a lot of features and lesser examples (i.e inputs), Mahalanobis algorithm tends to give misleading results (you can try it yourself), so the more features you have, the more examples you should provide.
The covariance matrix must be Symmetric and Positive Definite to make the algorithm works, so you should check before proceeding!

As it's already mentioned, Euclidean Metric fails to find the correct distance because it tries to get ordinary straight-line distance. So if we have multi-dimension space of variables, two points may look to have same distance from the Mean but in reality one of them is far away from the data cloud (i.e. it's extreme value).

The solution is Mahalanobis Distance which makes something similar to the feature scaling via taking the Eigenvectors of the variables instead of the original axis.

It applies the following formula:

in which:

x is the observation to find its distance
m is the mean of the observations
S is the Covariance Matrix

Refresher:

The Covariance represents the direction of the relationship between two variables (i.e. positive, negative or zero), so it shows strength of how one variable is related to changes of the other.

Implementation

Consider this 6x3 dataset example in which each row represents an input / example and each column represents a feature for the example:

First we need to create a Covariance Matrix of the features of each input, and that's why we set the parameter rowvar to False in the numpy.cov function, so each column represents a variable:

covariance_matrix = np.cov(data, rowvar=False)  
# data here looks similar to the above 2D / Matrix example in the pictures

Then we find the Inverse of the Covariance Matrix:

inv_covariance_matrix = np.linalg.inv(covariance_matrix)

But before proceeding, we should check -as mentioned above- if the matrix and its inverse are Symmetric and Positive Definite, we use for this Cholesky Decomposition Algorithm, fortunately it's already implemented in numpy.linalg.cholesky:

def is_pos_def(A):
    if np.allclose(A, A.T):
        try:
            np.linalg.cholesky(A)
            return True
        except np.linalg.LinAlgError:
            return False
    else:
        return False

Next, we find the mean m of the variables on each feature (shall I say dimension) and save them in an array like this:

vars_mean = []
for i in range(data.shape[0]):
    vars_mean.append(list(data.mean(axis=0)))  # axis=0 means each column in the 2D array

Note that I repeated each row just to avail of matrix subtraction as will be shown next.

Next, we find x - m (i.e. the differential) but we have already vectorized vars_mean so all we need to do is:

diff = data - vars_mean
# here we subtract the mean of feature from each feature of each example

Finally, apply the formula like this:

md = []
for i in range(len(diff)):
    md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))

Note the followings:

The dimension of the covariance matrix inverse is: number_of_features x number_of_features
The dimension of the diff matrix is similar to the original data matrix: number_of_examples x number_of_features
Thus, each diff[i] (i.e. row) is 1 x number_of_features.
So according to the Matrix Multiplication rule, the resulted matrix from diff[i].dot(inv_covariance_matrix) will be 1 x number_of_features and when we multiply again by diff[i] numpy automatically considers the later as a column matrix i.e. number_of_features x 1 so the final result will become a single value! (i.e. no need for transpose)

In order to detect the outliers, we should specify the threshold; we do so by multiplying the mean of the Mahalanobis Distance Results by the extremeness degree k in which k = 2.0 * std for extreme values and 3.0 * std for the very extreme values and that's according to the 68–95–99.7 rule (image for illustration from same link):

Putting All Together

import numpy as np


def create_data(examples=50, features=5, upper_bound=10, outliers_fraction=0.1, extreme=False):
    '''
    This method for testing (i.e. to generate a 2D array of data)
    '''
    data = []
    magnitude = 4 if extreme else 3
    for i in range(examples):
        if (examples - i) <= round((float(examples) * outliers_fraction)):
            data.append(np.random.poisson(upper_bound ** magnitude, features).tolist())
        else:
            data.append(np.random.poisson(upper_bound, features).tolist())
    return np.array(data)


def MahalanobisDist(data, verbose=False):
    covariance_matrix = np.cov(data, rowvar=False)
    if is_pos_def(covariance_matrix):
        inv_covariance_matrix = np.linalg.inv(covariance_matrix)
        if is_pos_def(inv_covariance_matrix):
            vars_mean = []
            for i in range(data.shape[0]):
                vars_mean.append(list(data.mean(axis=0)))
            diff = data - vars_mean
            md = []
            for i in range(len(diff)):
                md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))

            if verbose:
                print("Covariance Matrix:\n {}\n".format(covariance_matrix))
                print("Inverse of Covariance Matrix:\n {}\n".format(inv_covariance_matrix))
                print("Variables Mean Vector:\n {}\n".format(vars_mean))
                print("Variables - Variables Mean Vector:\n {}\n".format(diff))
                print("Mahalanobis Distance:\n {}\n".format(md))
            return md
        else:
            print("Error: Inverse of Covariance Matrix is not positive definite!")
    else:
        print("Error: Covariance Matrix is not positive definite!")


def MD_detectOutliers(data, extreme=False, verbose=False):
    MD = MahalanobisDist(data, verbose)
    # one popular way to specify the threshold
    #m = np.mean(MD)
    #t = 3. * m if extreme else 2. * m
    #outliers = []
    #for i in range(len(MD)):
    #    if MD[i] > t:
    #        outliers.append(i)  # index of the outlier
    #return np.array(outliers)

    # or according to the 68–95–99.7 rule
    std = np.std(MD)
    k = 3. * std if extreme else 2. * std
    m = np.mean(MD)
    up_t = m + k
    low_t = m - k
    outliers = []
    for i in range(len(MD)):
        if (MD[i] >= up_t) or (MD[i] <= low_t):
            outliers.append(i)  # index of the outlier
    return np.array(outliers)


def is_pos_def(A):
    if np.allclose(A, A.T):
        try:
            np.linalg.cholesky(A)
            return True
        except np.linalg.LinAlgError:
            return False
    else:
        return False


data = create_data(15, 3, 10, 0.1)
print("data:\n {}\n".format(data))

outliers_indices = MD_detectOutliers(data, verbose=True)

print("Outliers Indices: {}\n".format(outliers_indices))
print("Outliers:")
for ii in outliers_indices:
    print(data[ii])

Result

data:
 [[ 12   7   9]
 [  9  16   7]
 [ 14  11  10]
 [ 14   5   5]
 [ 12   8   7]
 [  8   8  10]
 [  9  14   8]
 [ 12  12  10]
 [ 18  10   6]
 [  6  12  11]
 [  4  12  15]
 [  5  13  10]
 [  8   9   8]
 [106 116  97]
 [ 90 116 114]]

Covariance Matrix:
 [[ 980.17142857 1143.62857143 1035.6       ]
 [1143.62857143 1385.11428571 1263.12857143]
 [1035.6        1263.12857143 1170.74285714]]

Inverse of Covariance Matrix:
 [[ 0.03021777 -0.03563241  0.0117146 ]
 [-0.03563241  0.08684092 -0.06217448]
 [ 0.0117146  -0.06217448  0.05757261]]

Variables Mean Vector:
 [[21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8]]

Variables - Variables Mean Vector:
 [[ -9.8 -17.6 -12.8]
 [-12.8  -8.6 -14.8]
 [ -7.8 -13.6 -11.8]
 [ -7.8 -19.6 -16.8]
 [ -9.8 -16.6 -14.8]
 [-13.8 -16.6 -11.8]
 [-12.8 -10.6 -13.8]
 [ -9.8 -12.6 -11.8]
 [ -3.8 -14.6 -15.8]
 [-15.8 -12.6 -10.8]
 [-17.8 -12.6  -6.8]
 [-16.8 -11.6 -11.8]
 [-13.8 -15.6 -13.8]
 [ 84.2  91.4  75.2]
 [ 68.2  91.4  92.2]]

Mahalanobis Distance:
 [1.3669401667524865, 2.1796331318432967, 0.7470525416547134, 1.6364973119931507, 0.8351423113609481, 0.9128858131134882, 1.397144258271586, 0.35603382066414996, 1.4449501739129382, 0.9668775289588046, 1.490503433100514, 1.4021488309805878, 0.4500345257064412, 3.239353067840299, 3.260149280200771]

Outliers Indices: [13 14]

Outliers:
[106 116  97]
[ 90 116 114]

回答2:

In multivariate data, Euclidean distance fails if there exists covariance between variables (i.e. in your case X, Y, Z).

Therefore, what Mahalanobis Distance does is,

It transforms the variables into uncorrelated space.
Make each variables varience equals to 1.
Then calculate the simple Euclidean distance.

We can calculate the Mahalanobis Distance for each data sample as follows,

Here, I have provided the python code and added the comments so that you can understand the code.

import numpy as np

data= np.matrix([[1, 2, 3, 4, 5, 6, 7, 8],[1, 4, 9, 16, 25, 36, 49, 64],[1, 4, 9, 16, 25, 16, 49, 64]])

def MahalanobisDist(data):
    covariance_xyz = np.cov(data) # calculate the covarince matrix
    inv_covariance_xyz = np.linalg.inv(covariance_xyz) #take the inverse of the covarince matrix
    xyz_mean = np.mean(data[0]),np.mean(data[1]),np.mean(data[2])
    x_diff = np.array([x_i - xyz_mean[0] for x_i in x]) # take the diffrence between the mean of X variable the sample
    y_diff = np.array([y_i - xyz_mean[1] for y_i in y]) # take the diffrence between the mean of Y variable the sample
    z_diff = np.array([z_i - xyz_mean[2] for z_i in z]) # take the diffrence between the mean of Z variable the sample
    diff_xyz = np.transpose([x_diff, y_diff, z_diff])

    md = []
    for i in range(len(diff_xyz)):
        md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xyz[i]),inv_covariance_xyz),diff_xyz[i]))) #calculate the Mahalanobis Distance for each data sample
    return md

def MD_removeOutliers(data):
    MD = MahalanobisDist(data)
    threshold = np.mean(MD) * 1.5 # adjust 1.5 accordingly
    outliers = []
    for i in range(len(MD)):
        if MD[i] > threshold:
            outliers.append(i) # index of the outlier
    return np.array(outliers)

print(MD_removeOutliers(data))

Hope this helps.

References,