How to draw the plot of within-cluster sum-of-squa

2019-02-03 20:09发布

I have a cluster plot by R while I want to optimize the "elbow criterion" of clustering with a wss plot, but I do not know how to draw a wss plot for a giving cluster, anyone would help me?

Here is my data:

Friendly<-c(0.467,0.175,0.004,0.025,0.083,0.004,0.042,0.038,0,0.008,0.008,0.05,0.096)
Polite<-c(0.117,0.55,0,0,0.054,0.017,0.017,0.017,0,0.017,0.008,0.104,0.1)
Praising<-c(0.079,0.046,0.563,0.029,0.092,0.025,0.004,0.004,0.129,0,0,0,0.029)
Joking<-c(0.125,0.017,0.054,0.383,0.108,0.054,0.013,0.008,0.092,0.013,0.05,0.017,0.067)
Sincere<-c(0.092,0.088,0.025,0.008,0.383,0.133,0.017,0.004,0,0.063,0,0,0.188)
Serious<-c(0.033,0.021,0.054,0.013,0.2,0.358,0.017,0.004,0.025,0.004,0.142,0.021,0.108)
Hostile<-c(0.029,0.004,0,0,0.013,0.033,0.371,0.363,0.075,0.038,0.025,0.004,0.046)
Rude<-c(0,0.008,0,0.008,0.017,0.075,0.325,0.313,0.004,0.092,0.063,0.008,0.088)
Blaming<-c(0.013,0,0.088,0.038,0.046,0.046,0.029,0.038,0.646,0.029,0.004,0,0.025)
Insincere<-c(0.075,0.063,0,0.013,0.096,0.017,0.021,0,0.008,0.604,0.004,0,0.1)
Commanding<-c(0,0,0,0,0,0.233,0.046,0.029,0.004,0.004,0.538,0,0.146)
Suggesting<-c(0.038,0.15,0,0,0.083,0.058,0,0,0,0.017,0.079,0.133,0.442)
Neutral<-c(0.021,0.075,0.017,0,0.033,0.042,0.017,0,0.033,0.017,0.021,0.008,0.717)

data <- data.frame(Friendly,Polite,Praising,Joking,Sincere,Serious,Hostile,Rude,Blaming,Insincere,Commanding,Suggesting,Neutral)

And here is my code of clustering:

cor <- cor (data)
dist<-dist(cor)
hclust<-hclust(dist)
plot(hclust)

And I will get a dendrogram after running the code above, while how can I draw a plot like this:

enter image description here

1条回答
冷血范
2楼-- · 2019-02-03 20:43

If I follow what you want, then we need a function to compute WSS

wss <- function(d) {
  sum(scale(d, scale = FALSE)^2)
}

and a wrapper for this wss() function

wrap <- function(i, hc, x) {
  cl <- cutree(hc, i)
  spl <- split(x, cl)
  wss <- sum(sapply(spl, wss))
  wss
}

This wrapper takes the following arguments, inputs:

  • i the number of clusters to cut the data into
  • hc the hierarchical cluster analysis object
  • x the original data

wrap then cuts the dendrogram into i clusters, splits the original data into the cluster membership given by cl and computes the WSS for each cluster. These WSS values are summed to give the WSS for that clustering.

We run all of this using sapply over the number of clusters 1, 2, ..., nrow(data)

res <- sapply(seq.int(1, nrow(data)), wrap, h = cl, x = data)

A screeplot can be drawn using

plot(seq_along(res), res, type = "b", pch = 19)

Here is an example using the well-known Edgar Anderson Iris data set:

iris2 <- iris[, 1:4]  # drop Species column
cl <- hclust(dist(iris2), method = "ward.D")

## Takes a little while as we evaluate all implied clustering up to 150 groups
res <- sapply(seq.int(1, nrow(iris2)), wrap, h = cl, x = iris2)
plot(seq_along(res), res, type = "b", pch = 19)

This gives:

enter image description here

We can zoom in by just showing the first 1:50 clusters

plot(seq_along(res[1:50]), res[1:50], type = "o", pch = 19)

which gives

enter image description here

You can speed up the main computation step by either running the sapply() via an appropriate parallelised alternative, or just do the computation for a fewer than nrow(data) clusters, e.g.

res <- sapply(seq.int(1, 50), wrap, h = cl, x = iris2) ## 1st 50 groups
查看更多
登录 后发表回答