I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
相关问题
- How to avoid out of memory python?
- Louvain community detection in R using igraph - as
- k-means using signature matrix generated from minh
- Confusion matrix for Clustering in scikit-learn
- Scikit-learn, KMeans: How to use max_iter
相关文章
- How to pick the T1 and T2 threshold values for Can
- Optimal way to cluster set of strings with hamming
- Document Clustering Basics
- Printing ClusterID and its elements using Spark KM
- how to use different distance formula other than e
- Is there any kind of subspace clustering package a
- Algorithm for clustering with minimum size constra
- In scikit-learn, can DBSCAN use sparse matrix?
You can maximize the Bayesian Information Criterion (BIC):
where
L(X | C)
is the log-likelihood of the datasetX
according to modelC
,p
is the number of parameters in the modelC
, andn
is the number of points in the dataset. See "X-means: extending K-means with efficient estimation of the number of clusters" by Dan Pelleg and Andrew Moore in ICML 2000.Another approach is to start with a large value for
k
and keep removing centroids (reducing k) until it no longer reduces the description length. See "MDL principle for robust vector quantisation" by Horst Bischof, Ales Leonardis, and Alexander Selb in Pattern Analysis and Applications vol. 2, p. 59-72, 1999.Finally, you can start with one cluster, then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution. In "Learning the k in k-means" (NIPS 2003), Greg Hamerly and Charles Elkan show some evidence that this works better than BIC, and that BIC does not penalize the model's complexity strongly enough.
Another approach is using Self Organizing Maps (SOP) to find optimal number of clusters. The SOM (Self-Organizing Map) is an unsupervised neural network methodology, which needs only the input is used to clustering for problem solving. This approach used in a paper about customer segmentation.
The reference of the paper is
Abdellah Amine et al., Customer Segmentation Model in E-commerce Using Clustering Techniques and LRFM Model: The Case of Online Stores in Morocco, World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:9, No:8, 2015, 1999 - 2010
Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0).
As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.
May be someone beginner like me looking for code example. information for silhouette_score is available here.
First build a minimum spanning tree of your data. Removing the K-1 most expensive edges splits the tree into K clusters,
so you can build the MST once, look at cluster spacings / metrics for various K, and take the knee of the curve.
This works only for Single-linkage_clustering, but for that it's fast and easy. Plus, MSTs make good visuals.
See for example the MST plot under stats.stackexchange visualization software for clustering.