I\'ve been studying about k-means clustering, and one thing that\'s not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
问题:
回答1:
You can maximize the Bayesian Information Criterion (BIC):
BIC(C | X) = L(X | C) - (p / 2) * log n
where L(X | C)
is the log-likelihood of the dataset X
according to model C
, p
is the number of parameters in the model C
, and n
is the number of points in the dataset.
See \"X-means: extending K-means with efficient estimation of the number of clusters\" by Dan Pelleg and Andrew Moore in ICML 2000.
Another approach is to start with a large value for k
and keep removing centroids (reducing k) until it no longer reduces the description length. See \"MDL principle for robust vector quantisation\" by Horst Bischof, Ales Leonardis, and Alexander Selb in Pattern Analysis and Applications vol. 2, p. 59-72, 1999.
Finally, you can start with one cluster, then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution. In \"Learning the k in k-means\" (NIPS 2003), Greg Hamerly and Charles Elkan show some evidence that this works better than BIC, and that BIC does not penalize the model\'s complexity strongly enough.
回答2:
Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0).
As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.
回答3:
Yes, you can find the best number of clusters using Elbow method, but I found it troublesome to find the value of clusters from elbow graph using script. You can observe the elbow graph and find the elbow point yourself, but it was lot of work finding it from script.
So another option is to use Silhouette Method to find it. The result from Silhouette completely comply with result from Elbow method in R.
Here`s what I did.
#Dataset for Clustering
n = 150
g = 6
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
mydata<-d
#Plot 3X2 plots
attach(mtcars)
par(mfrow=c(3,2))
#Plot the original dataset
plot(mydata$x,mydata$y,main=\"Original Dataset\")
#Scree plot to deterine the number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
}
plot(1:15, wss, type=\"b\", xlab=\"Number of Clusters\",ylab=\"Within groups sum of squares\")
# Ward Hierarchical Clustering
d <- dist(mydata, method = \"euclidean\") # distance matrix
fit <- hclust(d, method=\"ward\")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border=\"red\")
#Silhouette analysis for determining the number of clusters
library(fpc)
asw <- numeric(20)
for (k in 2:20)
asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat(\"silhouette-optimal number of clusters:\", k.best, \"\\n\")
plot(pam(d, k.best))
# K-Means Cluster Analysis
fit <- kmeans(mydata,k.best)
mydata
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, clusterid=fit$cluster)
plot(mydata$x,mydata$y, col = fit$cluster, main=\"K-means Clustering results\")
Hope it helps!!
回答4:
Look at this paper, \"Learning the k in k-means\" by Greg Hamerly, Charles Elkan. It uses a Gaussian test to determine the right number of clusters. Also, the authors claim that this method is better than BIC which is mentioned in the accepted answer.
回答5:
May be someone beginner like me looking for code example. information for silhouette_score is available here.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
range_n_clusters = [2, 3, 4] # clusters range you want to select
dataToFit = [[12,23],[112,46],[45,23]] # sample data
best_clusters = 0 # best cluster number which you will get
previous_silh_avg = 0.0
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters)
cluster_labels = clusterer.fit_predict(dataToFit)
silhouette_avg = silhouette_score(dataToFit, cluster_labels)
if silhouette_avg > previous_silh_avg:
previous_silh_avg = silhouette_avg
best_clusters = n_clusters
# Final Kmeans for best_clusters
kmeans = KMeans(n_clusters=best_clusters, random_state=0).fit(dataToFit)
回答6:
There is something called Rule of Thumb. It says that the number of clusters can be calculated by k = (n/2)^0,5, where n is the total number of elements from your sample. You can check the veracity of this information on the following paper:
http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf
There is also another method called G-means, where your distribution follows a Gaussian Distribution, or Normal Distribution. It consists on increasing k until all your k groups follow a Gaussian Distribution. It requires a lot of statistics, but can be done. Here is the source:
http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf
I hope this helps!
回答7:
First build a minimum spanning tree of your data.
Removing the K-1 most expensive edges splits the tree into K clusters,
so you can build the MST once, look at cluster spacings / metrics for various K,
and take the knee of the curve.
This works only for Single-linkage_clustering,
but for that it\'s fast and easy. Plus, MSTs make good visuals.
See for example the MST plot under
stats.stackexchange visualization software for clustering.
回答8:
If you use MATLAB, any version since 2013b that is, you can make use of the function evalclusters
to find out what should the optimal k
be for a given dataset.
This function lets you choose from among 3 clustering algorithms - kmeans
, linkage
and gmdistribution
.
It also lets you choose from among 4 clustering evaluation criteria - CalinskiHarabasz
, DaviesBouldin
, gap
and silhouette
.
回答9:
I\'m surprised nobody has mentioned this excellent article: http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
After following several other suggestions I finally came across this article while reading this blog: https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
After that I implemented it in Scala, an implementation which for my use cases provide really good results. Here\'s code:
import breeze.linalg.DenseVector
import Kmeans.{Features, _}
import nak.cluster.{Kmeans => NakKmeans}
import scala.collection.immutable.IndexedSeq
import scala.collection.mutable.ListBuffer
/*
https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
*/
class Kmeans(features: Features) {
def fkAlphaDispersionCentroids(k: Int, dispersionOfKMinus1: Double = 0d, alphaOfKMinus1: Double = 1d): (Double, Double, Double, Features) = {
if (1 == k || 0d == dispersionOfKMinus1) (1d, 1d, 1d, Vector.empty)
else {
val featureDimensions = features.headOption.map(_.size).getOrElse(1)
val (dispersion, centroids: Features) = new NakKmeans[DenseVector[Double]](features).run(k)
val alpha =
if (2 == k) 1d - 3d / (4d * featureDimensions)
else alphaOfKMinus1 + (1d - alphaOfKMinus1) / 6d
val fk = dispersion / (alpha * dispersionOfKMinus1)
(fk, alpha, dispersion, centroids)
}
}
def fks(maxK: Int = maxK): List[(Double, Double, Double, Features)] = {
val fadcs = ListBuffer[(Double, Double, Double, Features)](fkAlphaDispersionCentroids(1))
var k = 2
while (k <= maxK) {
val (fk, alpha, dispersion, features) = fadcs(k - 2)
fadcs += fkAlphaDispersionCentroids(k, dispersion, alpha)
k += 1
}
fadcs.toList
}
def detK: (Double, Features) = {
val vals = fks().minBy(_._1)
(vals._3, vals._4)
}
}
object Kmeans {
val maxK = 10
type Features = IndexedSeq[DenseVector[Double]]
}
回答10:
My idea is to use Silhouette Coefficient to find the optimal cluster number(K). Details explanation is here.
回答11:
Assuming you have a matrix of data called DATA
, you can perform partitioning around medoids with estimation of number of clusters (by silhouette analysis) like this:
library(fpc)
maxk <- 20 # arbitrary here, you can set this to whatever you like
estimatedK <- pamk(dist(DATA), krange=1:maxk)$nc
回答12:
One possible answer is to use Meta Heuristic Algorithm like Genetic Algorithm to find k. That\'s simple. you can use random K(in some range) and evaluate the fit function of Genetic Algorithm with some measurment like Silhouette And Find best K base on fit function.
https://en.wikipedia.org/wiki/Silhouette_(clustering)
回答13:
km=[]
for i in range(num_data.shape[1]):
kmeans = KMeans(n_clusters=ncluster[i])#we take number of cluster bandwidth theory
ndata=num_data[[i]].dropna()
ndata[\'labels\']=kmeans.fit_predict(ndata.values)
cluster=ndata
co=cluster.groupby([\'labels\'])[cluster.columns[0]].count()#count for frequency
me=cluster.groupby([\'labels\'])[cluster.columns[0]].median()#median
ma=cluster.groupby([\'labels\'])[cluster.columns[0]].max()#Maximum
mi=cluster.groupby([\'labels\'])[cluster.columns[0]].min()#Minimum
stat=pd.concat([mi,ma,me,co],axis=1)#Add all column
stat[\'variable\']=stat.columns[1]#Column name change
stat.columns=[\'Minimum\',\'Maximum\',\'Median\',\'count\',\'variable\']
l=[]
for j in range(ncluster[i]):
n=[mi.loc[j],ma.loc[j]]
l.append(n)
stat[\'Class\']=l
stat=stat.sort([\'Minimum\'])
stat=stat[[\'variable\',\'Class\',\'Minimum\',\'Maximum\',\'Median\',\'count\']]
if missing_num.iloc[i]>0:
stat.loc[ncluster[i]]=0
if stat.iloc[ncluster[i],5]==0:
stat.iloc[ncluster[i],5]=missing_num.iloc[i]
stat.iloc[ncluster[i],0]=stat.iloc[0,0]
stat[\'Percentage\']=(stat[[5]])*100/count_row#Freq PERCENTAGE
stat[\'Cumulative Percentage\']=stat[\'Percentage\'].cumsum()
km.append(stat)
cluster=pd.concat(km,axis=0)## see documentation for more info
cluster=cluster.round({\'Minimum\': 2, \'Maximum\': 2,\'Median\':2,\'Percentage\':2,\'Cumulative Percentage\':2})
回答14:
Another approach is using Self Organizing Maps (SOP) to find optimal number of clusters. The SOM (Self-Organizing Map) is an unsupervised neural network methodology, which needs only the input is used to clustering for problem solving. This approach used in a paper about customer segmentation.
The reference of the paper is
Abdellah Amine et al., Customer Segmentation Model in E-commerce Using Clustering Techniques and LRFM Model: The Case of Online Stores in Morocco, World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:9, No:8, 2015, 1999 - 2010
回答15:
I used the solution I found here : http://efavdb.com/mean-shift/ and it worked very well for me :
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from itertools import cycle
from PIL import Image
#%% Generate sample data
centers = [[1, 1], [-.75, -1], [1, -1], [-3, 2]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)
#%% Compute clustering with MeanShift
# The bandwidth can be automatically estimated
bandwidth = estimate_bandwidth(X, quantile=.1,
n_samples=500)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = labels.max()+1
#%% Plot result
plt.figure(1)
plt.clf()
colors = cycle(\'bgrcmykbgrcmykbgrcmykbgrcmyk\')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + \'.\')
plt.plot(cluster_center[0], cluster_center[1],
\'o\', markerfacecolor=col,
markeredgecolor=\'k\', markersize=14)
plt.title(\'Estimated number of clusters: %d\' % n_clusters_)
plt.show()