I am reading about the difference between k-means clustering and k-medoid clustering.
Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers.
I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.
What makes the pairwise distance measure commonly used in k-medoid better? More exactly, how does the lack of a squared term allow k-medoids to have the desirable properties associated with the concept of taking a median?
1. K-medoid is more flexible
First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.
2. Robustness of medoid
Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.
Consider this 1-dimensional example:
[1, 2, 3, 4, 100000]
Both the median and medoid of this set are 3. The mean is 20002.
Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...
Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).
I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.
3. k-medoids is much more expensive
That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i)
; whereas k-means runs in O(n*k*i)
where usually, k times the number of iterations is k*i << n
.
I think this has to do with the selection of the center for the cluster. k-means will select the "center" of the cluster, while k-medoid will select the "most centered" member of the cluster.
In a cluster with outliers (i.e. points far away from the other members of the cluster) k-means will place the center of the cluster towards the outliers, whereas k-medoid will select one of the more clustered members (the medoid) as the center.
It now depends on what you use clustering for. If you just wanted to classify a bunch of objects then you don't really care about where the center is; but if the clustering was used to train a decider which will now classify new objects based on those center points, then k-medoid will give you a center closer to where a human would place the center.
In wikipedia's words:
"It [k-medoid] is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances."
Here's an example:
Suppose you want to cluster on one dimension with k=2. One cluster has most of its members around 1000 and the other around -1000; but there is an outlier (or noise) at 100000.
It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000. This may even make some of the members of the 1000 cluster (say a member with value 500) to be assigned to the -1000 cluster.
k-medoid will select one of the members around 1000 as the medoid, it'll probably select one that is bigger than 1000, but it will not select an outlier.
Just a tiny note added to @Eli's answer, K-medoid is more robust to noise and outliers than k-means because the latter selects the cluster center, which is mostly just a "virtue point", on the other hand the former chooses the "actual object" from the cluster.
Suppose you have five 2D points in one cluster with the coordinates of (1,1),(1,2),(2,1),(2,2), and (100,100). If we don't consider the object exchanges among the clusters, with k-means you will get the center of cluster (21.2,21.2) which is pretty distracted by the point (100,100). However, with k-medoid will choose the center among (1,1),(1,2),(2,1),and (2,2) according to its algorithm.
Here is a fun applet ( E.M. Mirkes, K-means and K-medoids applet. University of Leicester, 2011 ) that you can randomly generate dataset in the 2D plane and compare k-medoid and k-means learning process.