Below code (minus my questions) generates this graph :
I have marked 4 areas of confusion with "->"
> m <- matrix(c(1,1,1) , ncol=3)
>
> x <- rbind(matrix(c(1,0,1) , ncol=3),
+ matrix(c(1,1,1) , ncol=3),
+ matrix(c(1,1,0) , ncol=3),
+ matrix(c(0,1,1) , ncol=3),
+ matrix(c(0,0,1) , ncol=3),
+ matrix(c(0,0,0) , ncol=3),
+ matrix(c(1,1,1) , ncol=3),
+ matrix(c(1,1,1) , ncol=3),
+ matrix(c(1,1,0) , ncol=3),
+ matrix(c(1,0,0) , ncol=3),
+ matrix(c(0,0,1) , ncol=3),
+ matrix(c(0,0,0) , ncol=3),
+ matrix(c(0,0,1) , ncol=3),
+ matrix(c(0,1,1) , ncol=3),
+ matrix(c(1,0,1) , ncol=3),
+ matrix(c(0,1,0) , ncol=3))
> colnames(x) <- c("google", "stackoverflow", "tester")
> (cl <- kmeans(x, 3))
K-means clustering with 3 clusters of sizes 3, 10, 3
-> Where are sizes 3, 10 3 appearing ?
Cluster means:
google stackoverflow tester
1 0.6666667 1.0 0
2 0.5000000 0.5 1
3 0.3333333 0.0 0
-> There are three clusters, but what does each number signify ?
Clustering vector:
[1] 2 2 1 2 2 3 2 2 1 3 2 3 2 2 2 1
-> This looks to be created by summing the values of each matrix but seems to be unordered as second element in this vector is '2' but second element in 'x' is matrix(c(1,1,1) , ncol=3) which is '3'
Within cluster sum of squares by cluster:
[1] 0.6666667 5.0000000 0.6666667
(between_SS / total_SS = 46.1 %)
-> what are between_SS & total_SS ?
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size"
> plot(x, col = cl$cluster)
> points(cl$centers, col = 1:5, pch = 8, cex = 2)
>
Can provide answers to these questions as from reading the implementation of this algorithm (http://en.wikipedia.org/wiki/K-means_clustering) I fail to see how r is computing these values
1. What does the cluster sizes mean?
You provided 16 records and told kmeans to find 3 clusters. It clustered those 16 records into 3 groups of A: 3 records, B: 10 records and C: 3 records.
2. What are the cluster means?
These numbers signify the location in N-Dimensional space of the centroid (the "mean") of each cluster. You have three clusters, so you have three means. You have three dimensions ("google", "stackoverflow", "tester") so you get a value in each dimension. Reading the numbers across the row gives the location of a single centroid.
3. What is the Clustering vector?
This is the cluster label the algorithm is giving each record you passed the algorithm. Remember how earlier I said there were 3 clusters of size 3, 10, and 3? These clusters are labeled as 1, 2 and 3, and the algorithm stores the cluster label for each record in this vector. Here, you can see that there are 3 "1"s, 10 "2"s, and 3 "3"s. Does that make sense?
4. What are between_SS & total_SS?
This is notation generally used in ANOVA. You might find this helpful: http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HrandBlock/randBlock7.html