Simplified example of what I'm trying to do:
Let's say I have 3 data points A, B, and C. I run KMeans clustering on this data and get 2 clusters [(A,B),(C)]
. Then I run MeanShift clustering on this data and get 2 clusters [(A),(B,C)]
. So clearly the two clustering methods have clustered the data in different ways. I want to be able to quantify this difference. In other words, what metric can I use to determine percent similarity/overlap between the two cluster groupings obtained from the two algorithms? Here is a range of scores that might be given:
- 100% score for
[(A,B),(C)]
vs.[(A,B),(C)]
- ~50% score for
[(A,B),(C)]
vs.[(A),(B,C)]
- ~20% score for
[(A,B),(C)]
vs.[(A,B,C)]
These scores are a bit arbitrary because I'm not sure how to measure similarity between two different cluster groupings. Keep in mind that this is a simplified example, and in real applications you can have many data points and also more than 2 clusters per cluster grouping. Having such a metric is also useful when trying to compare a cluster grouping to a labeled grouping of data (when you have labeled data).
Edit: One idea that I have is to take every cluster in the first cluster grouping and get its percent overlap with every cluster in the second cluster grouping. This would give you a similarity matrix of clusters in the first cluster grouping against clusters in the second cluster grouping. But then I'm not sure what you would do with this matrix. Maybe take the highest similarity score in each row or column and do something with that?