K-Means Algorithm [duplicate]

2019-01-26 01:31发布

问题:

Possible Duplicates:
How to optimal K in K - Means Algorithm
How do I determine k when using k-means clustering?

Depending on the statistical measures can we decide on the K. Like Standard Deviation, Mean, Variance etc., Or

Is there any simple method to choose the K in K-means Algorithm?

Thanks in advance Navin

回答1:

If you explicitly want to use k-means you could study the article describing x-means. When using an implementation of x-means the only difference compared to k-means, is that rather than specifying a single k, you specify a range for k. The "best" choice, wrt. some measure, in the range will be part of the output from x-means. You can also look into the Mean Shift clustering algorithm.

If it is computationally feasible with your given data (possibly using sampling as yura suggests), you could do clustering with various k's and evalute the quality of the resulting clusters using some of the standard cluster validity measures. Some of the classic measures are described here: measures.

@doug It is not correct that k-means++ determines an optimal k for the number of clusters before cluster assignments start. k-means++ differs from k-means only by instead of randomly choosing the initial k centroids, it chooses one initial centroid randomly and successively chooses centers until k has been chosen. After the initial completely random choice, data points are chosen as a new centroid with a probability that is determined by a potential function which depends on the datapoint's distance to the already chosen centers. The standard reference for k-means++ is k-means++: The Advantages of Careful Seeding by Arthur and Vassilvitskii.

Also, I don't think that in general choosing k to be the number of principal components will improve your clustering. Imagine data points in three-dimensional space all lying in a plane passing through the origo. You will then get 2 principal components, but the "natural" clustering of the points could have any number of clusters.



回答2:

Unfortunately not. There isn't a principled statistical method, simple or complex that can set the "right K". There are heuristics, rules of thumb that sometimes work, sometimes don't.

The situation is more general as many clustering methods have these type of parameters.



回答3:

Well there are two practical solutions to the the problem of intelligent selection of the number of centroids (k) in common use.

The first is to PCA your data, and the output from PCA--which is the principal components (eigenvectors) and their cumulate contribution to the variation observed in the data--obviously suggests an optimal number of centroids. (E.g., if 95% of the variability in your data is explained by the first three principal components, then k=3 is a wise choice for k-means.)

The second commonly used practical solution to intelligently estimate k is is a revised implementation of the k-means algorithm, called k-means++. In essence, k-means++ just differs from the original k-means by the additional of a pre-processing step. During this step, the number and initial position of the centroids and estimated.

The algorithm that k-means++ relies on to do this is straightforward to understand and to implement in code. A good source for both is a 2007 Post in the LingPipe Blog, which offers an excellent explanation of k-means++ as well as includes a citation to the original paper that first introduced this technique.

Aside from providing an optimal choice for k, k-means++ is apparently superior to the original k-means in both performance (roughly 1/2 processing time compared with k-means in one published comparison) and accuracy (three orders of magnitude improvement in error in the same comparison study).



回答4:

Bayesian k-means may be a solution when you don't know the number of clusters. There's a related paper given in the website and the corresponding MATLAB code is also given.



回答5:

The best solution for unkown(by statistical paramters model etc) ML problem is to sample data and find parameters thet best for sub problem, then use them on full problem. In that case select best K for 5% of data.