Expectation Maximization if a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier.
What is an intuitive explanation of this EM technique? What is expectation here and what is being maximized?
Expectation Maximization if a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier.
What is an intuitive explanation of this EM technique? What is expectation here and what is being maximized?
EM is used to maximize the likelihood of a model Q with latent variables Z.
It's an iterative optimization.
e-step: given current estimation of Z calculate the expected loglikelihood function
m-step: find theta which maximizes this Q
GMM Example:
e-step: estimate label assignments for each datapoint given the current gmm-parameter estimation
m-step: maximize a new theta given the new label assigments
K-means is also an EM algorithm and there is a lot of explaining animations on K-means.
Technically the term "EM" is a bit underspecified, but I assume you refer to the Gaussian Mixture Modelling cluster analysis technique, that is an instance of the general EM principle.
Actually, EM cluster analysis is not a classifier. I know that some people consider clustering to be "unsupervised classification", but actually cluster analysis is something quite different.
The key difference, and the big misunderstanding classification people always have with cluster analysis is that: in cluster analaysis, there is no "correct solution". It is a knowledge discovery method, it is actually meant to find something new! This makes evaluation very tricky. It is often evaluated using a known classification as reference, but that is not always appropriate: the classification you have may or may not reflect what is in the data.
Let me give you an example: you have a large data set of customers, including gender data. A method that splits this data set into "male" and "female" is optimal when you compare it with the existing classes. In a "prediction" way of thinking this is good, as for new users you could now predict their gender. In a "knowledge discovery" way of thinking this is actually bad, because you wanted to discover some new structure in the data. A method that would e.g. split the data into elderly people and kids however would score as worse as it can get with respect to the male/female class. However, that would be an excellent clustering result (if the age wasn't given).
Now back to EM. Essentially it assumes that your data is composed of multiple multivariate normal distributions (note that this is a very strong assumption, in particular when you fix the number of clusters!). It then tries to find a local optimal model for this by alternatingly improving the model and the object assignment to the model.
For best results in a classification context, choose the number of clusters larger than the number of classes, or even apply the clustering to single classes only (to find out whether there is some structure within the class!).
Say you want to train a classifier to tell apart "cars", "bikes" and "trucks". There is little use in assuming the data to consist of exactly 3 normal distributions. However, you may assume that there is more than one type of cars (and trucks and bikes). So instead of training a classifier for these three classes, you cluster cars, trucks and bikes into 10 clusters each (or maybe 10 cars, 3 trucks and 3 bikes, whatever), then train a classifier to tell apart these 30 classes, and then merge the class result back to the original classes. You may also discover that there is one cluster that is particularly hard to classify, for example Trikes. They're somewhat cars, and somewhat bikes. Or delivery trucks, that are more like oversized cars than trucks.