Understanding kmeans clustering in r [closed]

Below code (minus my questions) generates this graph :

enter image description here

I have marked 4 areas of confusion with "->"

> m <- matrix(c(1,1,1) , ncol=3)
> 
> x <- rbind(matrix(c(1,0,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(1,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(1,0,1) , ncol=3),
+            matrix(c(0,1,0) , ncol=3))
> colnames(x) <- c("google", "stackoverflow", "tester")
> (cl <- kmeans(x, 3))

K-means clustering with 3 clusters of sizes 3, 10, 3
-> Where are sizes 3, 10 3 appearing  ?

Cluster means:
     google stackoverflow tester
1 0.6666667           1.0      0
2 0.5000000           0.5      1
3 0.3333333           0.0      0

-> There are three clusters, but what does each number signify ?

Clustering vector:
 [1] 2 2 1 2 2 3 2 2 1 3 2 3 2 2 2 1

-> This looks to be created by summing the values of each matrix but seems to be unordered as second element in this vector is '2' but second element in 'x' is matrix(c(1,1,1) , ncol=3) which is '3'

Within cluster sum of squares by cluster:
[1] 0.6666667 5.0000000 0.6666667
 (between_SS / total_SS =  46.1 %)

-> what are between_SS & total_SS ?

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"        
> plot(x, col = cl$cluster)
> points(cl$centers, col = 1:5, pch = 8, cex = 2)
>

Can provide answers to these questions as from reading the implementation of this algorithm (http://en.wikipedia.org/wiki/K-means_clustering) I fail to see how r is computing these values

标签： r k-means

1条回答

▲ chillily

2楼-- · 2019-09-22 10:48

1. What does the cluster sizes mean?

You provided 16 records and told kmeans to find 3 clusters. It clustered those 16 records into 3 groups of A: 3 records, B: 10 records and C: 3 records.

2. What are the cluster means?

These numbers signify the location in N-Dimensional space of the centroid (the "mean") of each cluster. You have three clusters, so you have three means. You have three dimensions ("google", "stackoverflow", "tester") so you get a value in each dimension. Reading the numbers across the row gives the location of a single centroid.

3. What is the Clustering vector?

This is the cluster label the algorithm is giving each record you passed the algorithm. Remember how earlier I said there were 3 clusters of size 3, 10, and 3? These clusters are labeled as 1, 2 and 3, and the algorithm stores the cluster label for each record in this vector. Here, you can see that there are 3 "1"s, 10 "2"s, and 3 "3"s. Does that make sense?

4. What are between_SS & total_SS?

This is notation generally used in ANOVA. You might find this helpful: http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HrandBlock/randBlock7.html

0人赞添加讨论(0) 举报

Understanding kmeans clustering in r [closed]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间