MATLAB kMeans does not always converge to global m

2019-02-13 10:03发布

I wrote a k-Means clustering algorithm in MATLAB, and I thought I'd try it against MATLABs built in kmeans(X,k).

However, for the very easy four cluster setup (see picture), MATLAB kMeans does not always converge to the optimum solution (left) but to (right).

The one I wrote does not always do that either, but should not the built-in function be able to solve such an easy problem, always finding the optimal solution?

alt text

5条回答
男人必须洒脱
2楼-- · 2019-02-13 10:39

Although K-Means++ won't solve the problem in a single run, it tends to give better results when it is run N times (compared to running the original K-Means algorithm N times).

查看更多
Explosion°爆炸
3楼-- · 2019-02-13 10:43

I would not call it an easy problem.:) In fact, the Wikipedia article on "k-means clustering" gives a pretty gloomy picture of the computational complexity.

If you want to be free of the random restarts (dependence on the initial guess), a compromise is the "global k-means" algorithm; the paper and matlab code can be found here: http://lear.inrialpes.fr/~verbeek/software.php

查看更多
Melony?
4楼-- · 2019-02-13 10:50

As @Alexandre C. explained, the K-means algorithm depends on the initial cluster centroid positions, and there is no guarantee that it will converge to the optimal solution.

The best you can do is to repeat the experiment several times with random starting points.

MATLAB's implementation offers such an option: replicates which repeats the clustering N times and pick the one with the lowest total within-cluster point-to-centroid distances. You get also to control how the initial centroids are chosen with the start option.

In addition, MATLAB provides the choice among a number of distance measures (Euclidean, Manhattan, Cosine, ...). One neat option emptyaction allows you to control what happens when a cluster loses all its assigned member during the iterations.

But the real advantage is that it employs a two-phase algorithm: the usual assign-recompute iterations, followed by an online update phase. Be sure to read the algorithm section of the documentation page for more information.

查看更多
ら.Afraid
5楼-- · 2019-02-13 10:52

The k-means algorithm is quite sensitive to initial guess for the cluster centers. Did you try both codes with the same initial mass centers ?

The algorithm is simple, and I doubt there is much variation between your implementation and Matlab's.

查看更多
太酷不给撩
6楼-- · 2019-02-13 10:57

You will probably often be disappointed by the solution that any particular run of "the k-means algorithm" (i.e. Lloyd's algorithm) comes up with. That's because Lloyd's algorithm often gets stuck in bad local minima.

Luckily, Lloyd's is just one way of solving k-means. And there's an approach that almost always finds a better local minima.

The trick is to update the cluster assignments of the data points one at a time. You can do this efficiently by keeping a count of the number of points n assigned to each mean. So that you can re-calculate the cluster mean m after removing the point x as follows:

m_new = (n * m - x) / (n - 1)

And add x to cluster mean m using:

m_new = (n * m + x) / (n + 1)

Of course because it can't be vectorized, it's a bit painful to run in MATLAB but not too bad in other languages.

If you are really keen on getting the best possible local minima, and you don't mind using exemplar-based clustering, you should look at affinity propagation. MATLAB implementations are available on the Frey lab affinity propagation page.

查看更多
登录 后发表回答