I wonder if Triangle inequality is necessary for the distance measure used in kmeans.
问题:
回答1:
k-means is designed for Euclidean distance, which happens to satisfy triangle inequality.
Using other distance functions is risky, as it may stop converging. The reason however is not the triangle inequality, but the mean might not minimize the distance function. (The arithmetic mean minimizes the sum-of-squares, not arbitrary distances!)
There are faster methods for k-means that exploit the triangle inequality to avoid recomputations. But if you stick to classic MacQueen or Lloyd k-means, then you do not need the triangle inequality.
Just be careful with using other distance functions to not run into an infinite loop. You need to prove that the mean minimizes your distances to the cluster centers. If you cannot prove this, it may fail to converge, as the objective function no longer decreases monotonically! So you really should try to prove convergence for your distance function!
回答2:
Well, classical kmeans is defined on Euclidean space with L2 distance, so you get triangle inequality automatically from that (triangle inequality is part of how a distance/metric is defined). If you are using a non-euclidean metric, you would need to define what is the meaning of the "mean", amongst other things.
If you don't have triangle inequality, it means that two points could be very far from each other, but both can be close to a third point. You need to think how you would like to interpret this case.
Having said all that, I have in the past used average linkage hierarchical clustering with a distance measure that did not fulfill triangle inequality amongst other things, and it worked great for my needs.