How do I determine k when using k-means clustering

2019-01-01 16:36发布

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?

15条回答
泛滥B
2楼-- · 2019-01-01 16:56
km=[]
for i in range(num_data.shape[1]):
    kmeans = KMeans(n_clusters=ncluster[i])#we take number of cluster bandwidth theory
    ndata=num_data[[i]].dropna()
    ndata['labels']=kmeans.fit_predict(ndata.values)
    cluster=ndata
    co=cluster.groupby(['labels'])[cluster.columns[0]].count()#count for frequency
    me=cluster.groupby(['labels'])[cluster.columns[0]].median()#median
    ma=cluster.groupby(['labels'])[cluster.columns[0]].max()#Maximum
    mi=cluster.groupby(['labels'])[cluster.columns[0]].min()#Minimum
    stat=pd.concat([mi,ma,me,co],axis=1)#Add all column
    stat['variable']=stat.columns[1]#Column name change
    stat.columns=['Minimum','Maximum','Median','count','variable']
    l=[]
    for j in range(ncluster[i]):
        n=[mi.loc[j],ma.loc[j]] 
        l.append(n)

    stat['Class']=l
    stat=stat.sort(['Minimum'])
    stat=stat[['variable','Class','Minimum','Maximum','Median','count']]
    if missing_num.iloc[i]>0:
        stat.loc[ncluster[i]]=0
        if stat.iloc[ncluster[i],5]==0:
            stat.iloc[ncluster[i],5]=missing_num.iloc[i]
            stat.iloc[ncluster[i],0]=stat.iloc[0,0]
    stat['Percentage']=(stat[[5]])*100/count_row#Freq PERCENTAGE
    stat['Cumulative Percentage']=stat['Percentage'].cumsum()
    km.append(stat)
cluster=pd.concat(km,axis=0)## see documentation for more info
cluster=cluster.round({'Minimum': 2, 'Maximum': 2,'Median':2,'Percentage':2,'Cumulative Percentage':2})
查看更多
时光乱了年华
3楼-- · 2019-01-01 16:59

You can maximize the Bayesian Information Criterion (BIC):

BIC(C | X) = L(X | C) - (p / 2) * log n

where L(X | C) is the log-likelihood of the dataset X according to model C, p is the number of parameters in the model C, and n is the number of points in the dataset. See "X-means: extending K-means with efficient estimation of the number of clusters" by Dan Pelleg and Andrew Moore in ICML 2000.

Another approach is to start with a large value for k and keep removing centroids (reducing k) until it no longer reduces the description length. See "MDL principle for robust vector quantisation" by Horst Bischof, Ales Leonardis, and Alexander Selb in Pattern Analysis and Applications vol. 2, p. 59-72, 1999.

Finally, you can start with one cluster, then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution. In "Learning the k in k-means" (NIPS 2003), Greg Hamerly and Charles Elkan show some evidence that this works better than BIC, and that BIC does not penalize the model's complexity strongly enough.

查看更多
高级女魔头
4楼-- · 2019-01-01 16:59

Another approach is using Self Organizing Maps (SOP) to find optimal number of clusters. The SOM (Self-Organizing Map) is an unsupervised neural network methodology, which needs only the input is used to clustering for problem solving. This approach used in a paper about customer segmentation.

The reference of the paper is

Abdellah Amine et al., Customer Segmentation Model in E-commerce Using Clustering Techniques and LRFM Model: The Case of Online Stores in Morocco, World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:9, No:8, 2015, 1999 - 2010

查看更多
骚的不知所云
5楼-- · 2019-01-01 17:01

Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0).

As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.

查看更多
明月照影归
6楼-- · 2019-01-01 17:01

May be someone beginner like me looking for code example. information for silhouette_score is available here.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

range_n_clusters = [2, 3, 4]            # clusters range you want to select
dataToFit = [[12,23],[112,46],[45,23]]  # sample data
best_clusters = 0                       # best cluster number which you will get
previous_silh_avg = 0.0

for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters)
    cluster_labels = clusterer.fit_predict(dataToFit)
    silhouette_avg = silhouette_score(dataToFit, cluster_labels)
    if silhouette_avg > previous_silh_avg:
        previous_silh_avg = silhouette_avg
        best_clusters = n_clusters

# Final Kmeans for best_clusters
kmeans = KMeans(n_clusters=best_clusters, random_state=0).fit(dataToFit)
查看更多
梦该遗忘
7楼-- · 2019-01-01 17:02

First build a minimum spanning tree of your data. Removing the K-1 most expensive edges splits the tree into K clusters,
so you can build the MST once, look at cluster spacings / metrics for various K, and take the knee of the curve.

This works only for Single-linkage_clustering, but for that it's fast and easy. Plus, MSTs make good visuals.
See for example the MST plot under stats.stackexchange visualization software for clustering.

查看更多
登录 后发表回答