On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated.
More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.
Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure?
points = numpy.array([[ 0, 0],
[ 0, 1],
[ 0, -1],
[ 1, 0],
[-1, 0],
[ 9, 9],
[ 9, 10],
[ 9, 8],
[10, 9],
[10, 8]])
kmeans(pp,2)
(array([[9, 8],
[0, 0]]), 0.9414213562373096)
I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points?
In short, my questions (all related) are the following:
- Given a distance matrix and a mapping of which point belongs to which cluster, what is a good way of computing a measure that can be used to draw the elbow plot?
- How would the methodology change if a different distance function such as cosine similarity is used?
EDIT 2: Distortion
from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))
The output for the first set of points is accurate. However, when I try a different set:
>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
[1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797
I guess the last value does not match because kmeans
seems to be diving the value by the total number of points in the dataset.
EDIT 1: Percent Variance
My code so far (should be added to Denis's K-means implementation):
centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )
print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
print "Cluster: ", cluster
truthcondition = ([x == cluster for x in xtoc])
distances_inside_cluster = (truthcondition * dist)
indices = [i for i,x in enumerate(truthcondition) if x == True]
final_distances = [distances_inside_cluster[k] for k in indices]
print final_distances
print np.array(final_distances).var()
cluster_vars.append(np.array(final_distances).var())
print ""
print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())
And following is the output for k=2:
Unique clusters: set([0, 1])
Cluster: 0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041
Cluster: 1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16
Sum of variances: 0.587451660041
Total Variance: 21.1475
Percent: 2.77787757437
On my real dataset (does not look right to me!):
Sum of variances: 0.0188124746402
Total Variance: 0.00313754329764
Percent: 599.592510943
Unique clusters: set([0, 1, 2, 3])
Sum of variances: 0.0255808508714
Total Variance: 0.00313754329764
Percent: 815.314672809
Unique clusters: set([0, 1, 2, 3, 4])
Sum of variances: 0.0588210052519
Total Variance: 0.00313754329764
Percent: 1874.74720416
Unique clusters: set([0, 1, 2, 3, 4, 5])
Sum of variances: 0.0672406353655
Total Variance: 0.00313754329764
Percent: 2143.09824556
Unique clusters: set([0, 1, 2, 3, 4, 5, 6])
Sum of variances: 0.0646291452839
Total Variance: 0.00313754329764
Percent: 2059.86465055
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7])
Sum of variances: 0.0817517362176
Total Variance: 0.00313754329764
Percent: 2605.5970695
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8])
Sum of variances: 0.0912820650486
Total Variance: 0.00313754329764
Percent: 2909.34837831
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Sum of variances: 0.102119601368
Total Variance: 0.00313754329764
Percent: 3254.76309585
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Sum of variances: 0.125549475536
Total Variance: 0.00313754329764
Percent: 4001.52168834
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Sum of variances: 0.138469402779
Total Variance: 0.00313754329764
Percent: 4413.30651542
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence)
If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using
pdist2
function, but it should be straightforward to rewrite in Python/Numpy/Scipy):the result:
EDIT#1:
I had some time to play around with this.. Here is an example of KMeans clustering applied on the 'Fisher Iris Dataset' (4 features, 150 instances). We iterate over
k=1..10
, plot the elbow curve, pickK=3
as number of clusters, and show a scatter plot of the result.Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The
scipy.cluster.vq.kmeans
function returns this measure by default (computed with Euclidean as a distance measure). You can also use thescipy.spatial.distance.cdist
function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: @Denis have a solution for that), then compute the distortion from that.EDIT#2:
In response to the comments, I give below another complete example using the NIST hand-written digits dataset: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: Principal Components Analysis is applied to reduce the dimensionality from 64 down to 2:
You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.
Note: An implementation of K-means is included in
scikit-learn
(as well as many other clustering algorithms and various clustering metrics). Here is another similar example.A simple cluster measure:
1) draw "sunburst" rays from each point to its nearest cluster centre,
2) look at the lengths — distance( point, centre, metric=... ) — of all the rays.
For
metric="sqeuclidean"
and 1 cluster, the average length-squared is the total varianceX.var()
; for 2 clusters, it's less ... down to N clusters, lengths all 0. "Percent of variance explained" is 100 % - this average.Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:
Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
See also stats.stackexchange.com/questions/tagged/clustering, in particular
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?