Evaluating K-means accuracy

2019-01-18 16:01发布

问题:

I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my samples based on created 4 patterns/classes.

I need help with the following;

  1. What function/code can I use to evaluate how well the K-means algorithm has identified the classes of my samples correctly? Assuming I set K=4 as illustrated in the image below:

  1. How can I automatically identify the number of classes (K)? Assuming the classes in my data is unknown?

My aim is to evaluate K-mean's accuracy and how changes to the data (by pre-processing) affects the algorithm’s ability to identify classes. Examples with MATLAB code would be helpful!

回答1:

One basic metric to measure how "good" your clustering in comparison to your known class labels is called purity. Now this is an example of supervised learning where you have some idea of an external metric that is a labeling of instances based on real world data.

The mathematical definition of purity is as follows:

In words what this means is, quoting from a professor at Stanford university here,

To compute purity , each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.

A simple example would be if you had a very naive clustering that was produced via Kmeans with k=2 that looked like:

Cluster1    Label
  1           A         
  5           B
  7           B
  3           B
  2           B

Cluster2    Label
  4           A
  6           A
  8           A
  9           B

In Cluster1 there are 4 instances of label B and 1 instance of label A and Cluster2 has 3 instances with label A and 1 instance of cluster B. Now you are looking for the total purity so that would be the sum of the purities of each cluster, in this case k=2. So the purity of Cluster1 is the maximum number of instances in respect to the given labels divided by the total number of instances in Cluster1.

Therefore the purity of Cluster1 is:

4/5 = 0.80

The four comes from the fact that the label that occurs the most (B) occurs 4 times and there are 5 total instances in the cluster.

So this follows that the purity of Cluster2 is:

3/4 = 0.75

Now the total purity is just the sum of the purities which is 1.55. So what does this tell us? A cluster is considered to be "pure" if it has a purity of 1 since that indicates that all of the instances in that cluster are of the same label. This means your original label classification was pretty good and that your Kmeans did a pretty good job. The "best" purity score for an entire data set would be equal to your original K-number of clusters since that would imply that every cluster has an individual purity score of 1.

However, you need to be aware that purity is not always the best or most telling metric. For example, if you had 10 points and you chose k=10 then every cluster would have a purity of 1 and therefore an overall purity of 10 which equal k. In that instance it would be better to use different external metrics such as precision, recall, and F-measure. I would suggest looking into those if you can. And again to reiterate, this is only useful with supervised learning where you have pre-knowledge of a labeling system which I believe is the case from your question.

To answer your second question... choosing your K number of clusters is the most difficult part for Kmeans without any prior knowledge of the data. There are techniques as to mitigate the problems presented by choosing the initial K-number of clusters and centroids. Probably the most common is an algorithm called Kmeans++. I would suggest looking into that for further info.



回答2:

In addition to purity scores, consider using the following clustering metrics: Normalized Mutual Information (NMI), Variation of Information (VI) and Adjusted Rand Index (ARI). Given the predicted label assignments X and the ground truth labels Y, the NMI is defined as:

NMI(X;Y) = I(X;Y) / ((H(X)+H(Y))/2

where H(X) is the entropy and I(X;Y) is the mutual information. As the overlap between X and Y increases the NMI approaches 1. See Matlab implementation here. Variation of Information is defined as:

VI(X;Y) = H(X)+H(Y)-2I(X;Y) = H(X|Y) + H(Y|X)

Thus, VI decreases as the overlap between label assignments X and Y increases. See Matlab implementation here. Finally, the adjusted Rand index is defined as:

ARI = RI-E[RI] / (max RI - E[RI])
RI = TP + TN / (TP + FP + FN + TN)

Thus, ARI approaches 1 for cluster assignments that are similar to each other. See Python implementation here.

If you are interested in choosing the number of clusters K automatically based on data, consider using Dirichlet Process (DP) K-means. See paper and code for more information.