I wrote a k-Means clustering algorithm in MATLAB, and I thought I'd try it against MATLABs built in kmeans(X,k)
.
However, for the very easy four cluster setup (see picture), MATLAB kMeans does not always converge to the optimum solution (left) but to (right).
The one I wrote does not always do that either, but should not the built-in function be able to solve such an easy problem, always finding the optimal solution?
Although K-Means++ won't solve the problem in a single run, it tends to give better results when it is run N times (compared to running the original K-Means algorithm N times).
I would not call it an easy problem.:) In fact, the Wikipedia article on "k-means clustering" gives a pretty gloomy picture of the computational complexity.
If you want to be free of the random restarts (dependence on the initial guess), a compromise is the "global k-means" algorithm; the paper and matlab code can be found here: http://lear.inrialpes.fr/~verbeek/software.php
As @Alexandre C. explained, the K-means algorithm depends on the initial cluster centroid positions, and there is no guarantee that it will converge to the optimal solution.
The best you can do is to repeat the experiment several times with random starting points.
MATLAB's implementation offers such an option:
replicates
which repeats the clustering N times and pick the one with the lowest total within-cluster point-to-centroid distances. You get also to control how the initial centroids are chosen with thestart
option.In addition, MATLAB provides the choice among a number of distance measures (Euclidean, Manhattan, Cosine, ...). One neat option
emptyaction
allows you to control what happens when a cluster loses all its assigned member during the iterations.But the real advantage is that it employs a two-phase algorithm: the usual assign-recompute iterations, followed by an online update phase. Be sure to read the algorithm section of the documentation page for more information.
The k-means algorithm is quite sensitive to initial guess for the cluster centers. Did you try both codes with the same initial mass centers ?
The algorithm is simple, and I doubt there is much variation between your implementation and Matlab's.
You will probably often be disappointed by the solution that any particular run of "the k-means algorithm" (i.e. Lloyd's algorithm) comes up with. That's because Lloyd's algorithm often gets stuck in bad local minima.
Luckily, Lloyd's is just one way of solving k-means. And there's an approach that almost always finds a better local minima.
The trick is to update the cluster assignments of the data points one at a time. You can do this efficiently by keeping a count of the number of points
n
assigned to each mean. So that you can re-calculate the cluster meanm
after removing the pointx
as follows:And add
x
to cluster meanm
using:Of course because it can't be vectorized, it's a bit painful to run in MATLAB but not too bad in other languages.
If you are really keen on getting the best possible local minima, and you don't mind using exemplar-based clustering, you should look at affinity propagation. MATLAB implementations are available on the Frey lab affinity propagation page.