Implementation of EM algorithm for Gaussian Mixtur

Using the EM algorithm, I want to train a Gaussian Mixture model with four components on a given dataset. The set is three dimensional and contains 300 samples.

The problem is that after about 6 rounds of the EM algorithm, the covariance matrices sigma become close to singular according to matlab (rank(sigma) = 2 instead of 3). This in turn leads to undesired results like complex values evaluating the gaussian distribution gm(k,i).

Furthermore I used the log of the gaussian to account for underflow troubles - see E-step. I am not sure if this is correct and if I have to take the exp of the responsibilites p(w_k | x^(i), theta) somewhere else?

Can you tell me if my implementation of the EM algorithm is correct so far? And how to account for the problem with the close to singular covariance sigma?

Here is my implementation of the EM algorithm:

First I initialized the means and the covariance of the components using kmeans:

load('data1.mat');

X = Data'; % 300x3 data set
D = size(X,2); % dimension
N = size(X,1); % number of samples
K = 4; % number of Gaussian Mixture components

% Initialization
p = [0.2, 0.3, 0.2, 0.3]; % arbitrary pi
[idx,mu] = kmeans(X,K); % initial means of the components

% compute the covariance of the components
sigma = zeros(D,D,K);
for k = 1:K
    sigma(:,:,k) = cov(X(idx==k,:));
end

For the E-step I am using the following formula to calculate the responsibilities.

w_k are the k gaussian components.

x^(i) is a single datapoint (sample)

theta stands for the parameters of the gaussian mixture model: mu, Sigma, pi.

Here is the corresponding code:

% variables for convergence 
converged = 0;
prevLoglikelihood = Inf;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
round = 0;
while (converged ~= 1)
    round = round +1
    gm = zeros(K,N); % gaussian component in the nominator
    sumGM = zeros(N,1); % denominator of responsibilities
    % E-step:  Evaluate the responsibilities using the current parameters
    % compute the nominator and denominator of the responsibilities
    for k = 1:K
        for i = 1:N
             Xmu = X-mu;
             % I am using log to prevent underflow of the gaussian distribution (exp("small value"))
             logPdf = log(1/sqrt(det(sigma(:,:,k))*(2*pi)^D)) + (-0.5*Xmu*(sigma(:,:,k)\Xmu'));
             gm(k,i) = log(p(k)) * logPdf;
             sumGM(i) = sumGM(i) + gm(k,i);
         end
    end

    % calculate responsibilities
    res = zeros(K,N); % responsibilities
    Nk = zeros(4,1);
    for k = 1:K
        for i = 1:N
            % I tried to use the exp(gm(k,i)/sumGM(i)) to compute res but this leads to sum(pi) > 1.
            res(k,i) = gm(k,i)/sumGM(i);
        end
        Nk(k) = sum(res(k,:));
    end

Nk(k) is computed using the formula given in the M-step and is used in the M-step to calculate the new probabilities p(k).

M-step

    % M-step: Re-estimate the parameters using the current responsibilities
    for k = 1:K
        for i = 1:N
            mu(k,:) = mu(k,:) + res(k,i).*X(k,:);
            sigma(:,:,k) = sigma(:,:,k) + res(k,i).*(X(k,:)-mu(k,:))*(X(k,:)-mu(k,:))';
        end
        mu(k,:) = mu(k,:)./Nk(k);
        sigma(:,:,k) = sigma(:,:,k)./Nk(k);
        p(k) = Nk(k)/N;
    end

Now in order to check for convergence the log-likelihood is computed using this formula:

    % Evaluate the log-likelihood and check for convergence of either 
    % the parameters or the log-likelihood. If not converged, go to E-step.
    loglikelihood = 0;
    for i = 1:N
        loglikelihood = loglikelihood + log(sum(gm(:,i)));
    end


    % Check for convergence of parameters
    errorLoglikelihood = abs(loglikelihood-prevLoglikelihood);
    if (errorLoglikelihood <= eps)
        converged = 1; 
    end

    errorMu = abs(mu(:)-prevMu(:));
    errorSigma = abs(sigma(:)-prevSigma(:));
    errorPi = abs(p(:)-prevPi(:));

    if (all(errorMu <= eps) && all(errorSigma <= eps) && all(errorPi <= eps))
        converged = 1;
    end

    prevLoglikelihood = loglikelihood;
    prevMu = mu;
    prevSigma = sigma;
    prevPi = p;

end % while

Is there something wrong with my Matlab implementation of EM algorithm for Gaussian Mixture Models?

Previous troubles:

The problem is that I cannot check for convergence using the log-likelihood because it is -Inf. This results from rounded zero values while evaluating the gaussian in the formula of the responsibilities (see E-step).

Can you tell me if my implementation of the EM algorithm is correct so far? And how to account for the problem with the rounded zero values?

Here is my implementation of the EM algorithm:

First I initialized the means and the covariance of the components using kmeans:

load('data1.mat');

X = Data'; % 300x3 data set
D = size(X,2); % dimension
N = size(X,1); % number of samples
K = 4; % number of Gaussian Mixture components

% Initialization
p = [0.2, 0.3, 0.2, 0.3]; % arbitrary pi
[idx,mu] = kmeans(X,K); % initial means of the components

% compute the covariance of the components
sigma = zeros(D,D,K);
for k = 1:K
    sigma(:,:,k) = cov(X(idx==k,:));
end

For the E-step I am using the following formula to calculate the responsibilities

Here is the corresponding code:

% variables for convergence 
converged = 0;
prevLoglikelihood = Inf;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
round = 0;
while (converged ~= 1)
    round = round +1
    gm = zeros(K,N); % gaussian component in the nominator - 
                     % some values evaluate to zero
    sumGM = zeros(N,1); % denominator of responsibilities
    % E-step:  Evaluate the responsibilities using the current parameters
    % compute the nominator and denominator of the responsibilities
    for k = 1:K
        for i = 1:N
             % HERE values evalute to zero e.g. exp(-746.6228) = -Inf
             gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*(X(i,:)-mu(k,:))*inv(sigma(:,:,k))*(X(i,:)-mu(k,:))');
             sumGM(i) = sumGM(i) + gm(k,i);
         end
    end

    % calculate responsibilities
    res = zeros(K,N); % responsibilities
    Nk = zeros(4,1);
    for k = 1:K
        for i = 1:N
            res(k,i) = gm(k,i)/sumGM(i);
        end
        Nk(k) = sum(res(k,:));
    end

Nk(k) is computed using the formula given in the M-step.

M-step

    % M-step: Re-estimate the parameters using the current responsibilities
    mu = zeros(K,3);
    for k = 1:K
        for i = 1:N
            mu(k,:) = mu(k,:) + res(k,i).*X(k,:);
            sigma(:,:,k) = sigma(:,:,k) + res(k,i).*(X(k,:)-mu(k,:))*(X(k,:)-mu(k,:))';
        end
        mu(k,:) = mu(k,:)./Nk(k);
        sigma(:,:,k) = sigma(:,:,k)./Nk(k);
        p(k) = Nk(k)/N;
    end

Now in order to check for convergence the log-likelihood is computed using this formula:

    % Evaluate the log-likelihood and check for convergence of either 
    % the parameters or the log-likelihood. If not converged, go to E-step.
    loglikelihood = 0;
    for i = 1:N
        loglikelihood = loglikelihood + log(sum(gm(:,i)));
    end


    % Check for convergence of parameters
    errorLoglikelihood = abs(loglikelihood-prevLoglikelihood);
    if (errorLoglikelihood <= eps)
        converged = 1; 
    end

    errorMu = abs(mu(:)-prevMu(:));
    errorSigma = abs(sigma(:)-prevSigma(:));
    errorPi = abs(p(:)-prevPi(:));

    if (all(errorMu <= eps) && all(errorSigma <= eps) && all(errorPi <= eps))
        converged = 1;
    end

    prevLoglikelihood = loglikelihood;
    prevMu = mu;
    prevSigma = sigma;
    prevPi = p;

end % while

After the first round the loglikelihood is around 700. In the second round it is -Inf because some gm(k,i) values in the E-step are zero. Therefore the log is obviously negative infinity.

The zero values also lead to sumGM equals to zero and therefore leading to all NaN entries inside the mu and sigma matrices.

How can I solve this problem? Can you tell me if there is something wrong with my implementation? Could it be solved by increasing Matlab's precision somehow?

EDIT:

I added a scaling for the exp() term in gm(k,i). Unfortunately this doesn't help much. After some more rounds I still get the underflow problem.

scale = zeros(N,D);
for i = 1:N
    max = 0;
    for k = 1:K
        Xmu = X(i,:)-mu(k,:);
        if (norm(scale(i,:) - Xmu) > max)
            max = norm(scale(i,:) - Xmu);
            scale(i,:) = Xmu;
        end
    end
end


for k = 1:K
    for i = 1:N
        Xmu = X(i,:)-mu(k,:);
        % scale gm to prevent underflow
        Xmu = Xmu - scale(i,:);
        gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*Xmu*inv(sigma(:,:,k))*Xmu');
        sumGM(i) = sumGM(i) + gm(k,i);
    end
end

Further I noticed that kmeans initializes the means completely different compared to the following rounds where the means are computed in the M-step.

kmeans:

mu =   13.500000000000000   0.026602138870044   0.062415945993735
       88.500000000000000  -0.009869960132085  -0.075177888210981
       39.000000000000000  -0.042569305020309   0.043402772876513
       64.000000000000000  -0.024519281362918  -0.012586980924762

after M-step:

round = 2

mu = 1.000000000000000   0.077230046948357   0.024498886414254
     2.000000000000000   0.074260118474053   0.026484346404660
     3.000000000000002   0.070944016105476   0.029043085983168
     4.000000000000000   0.067613431480832   0.031641849205021

In the next rounds mu doesn't change at all. It stays the same as in round 2.

I guess this is caused because of the underflow in gm(k,i)? Either my implementation of the scaling is incorrect or the whole implementation of the algorithm is wrong somewhere :(

EDIT 2

After four rounds I got NaN values and looked into gm in more detail. Looking at only one sample (and without the 0.5 factor), gm becomes zero in all components. Put in matlab gm(:,1) = [0 0 0 0]. This in turn leads to sumGM equal to zero -> NaN because I divided by zero. I have given more details in

round = 1

mu = 62.0000   -0.0298   -0.0078
     37.0000   -0.0396    0.0481
     87.5000   -0.0083   -0.0728
     12.5000    0.0303    0.0614

gm(:,1) = [11.7488, 0.0000, 0.0000, 0.0000]


round = 2

mu = 1.0000    0.0772    0.0245
     2.0000    0.0743    0.0265
     3.0000    0.0709    0.0290
     4.0000    0.0676    0.0316


gm(:,1) = [0.0000, 0.0000, 0.0000, 0.3128]

round = 3

mu = 1.0000    0.0772    0.0245
     2.0000    0.0743    0.0265
     3.0000    0.0709    0.0290
     4.0000    0.0676    0.0316


gm(:,1) = [0, 0, 0.0000, 0.2867]


round = 4


mu = 1.0000    0.0772    0.0245
        NaN       NaN       NaN
     3.0000    0.0709    0.0290
     4.0000    0.0676    0.0316

gm(:,1) = 1.0e-105 * [0, NaN, 0, 0.5375]

First of all the means doesn't seem to change and are completely different compared to the initializaiton of kmeans.

And every sample (not just for the first one like here) corresponds only to one gaussian component according to the output of gm(:,1). Shouldn't the sample be "partially distributed" among every gaussian component?

EDIT3:

So I guess the problem with mu not changing was the first line in the M-step: mu = zeros(K,3);.

To account the underflow problem I am currently trying to use the log of the gaussian:

function logPdf = logmvnpdf(X, mu, sigma, D)
    Xmu = X-mu;
    logPdf = log(1/sqrt(det(sigma)*(2*pi)^D)) + (-0.5*Xmu*inv(sigma)*Xmu');
end

The new problem is the covariance matrix sigma. Matlab claims: Warning: Matrix is close to singular or badly scaled. Results may be inaccurate.

After 6 rounds I get imaginary values for gm (gaussian distribution).

The updated E-Step looks like this now:

gm = zeros(K,N); % gaussian component in the nominator
sumGM = zeros(N,1); % denominator of responsibilities


for k = 1:K
    for i = 1:N
        %gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*Xmu*inv(sigma(:,:,k))*Xmu');
        %gm(k,i) = p(k)*mvnpdf(X(i,:),mu(k,:),sigma(:,:,k));
        gm(k,i) = log(p(k)) + logmvnpdf(X(i,:), mu(k,:), sigma(:,:,k), D);
        sumGM(i) = sumGM(i) + gm(k,i);
    end
end

It looks like you should be able to use a scale factor scale(i) to bring gm(k, i) into a representable range, because if you multiply gm(k, i) by scale(i) this will end up multiplying sumGM(i) as well, and be cancelled away when you work out res(k, i) = gm(k, i) / sumGM(i).

I would make scale(i) = 1 / max_k(exp(-0.5*(X(i,:)-mu(k,:))) in theory, and actually calculate it without doing the exponentiation, so you end up dealing with its log, max_k(-0.5*(X(i,:)-mu(k,:)) - this gives you a common term you can add to each -0.5*(X(i,:)-mu(k,:) before using exp() and will keep at least the maximum within a representable range - anything that still underflows to zero after this correction you don't care about anyway, because it is vanishingly small compared to the other contributions.