Extract labels membership / classification from a

2019-03-11 08:52发布

问题:

I'm trying to extract a classification from a dendrogram in R that I've cut at a certain height. This is easy to do with cutree on an hclustobject, but I can't figure out how to do it on a dendrogram object.

Further, I can't just use my clusters from the original hclust, becuase (frustratingly), the numbering of the classes from cutree is different from the numbering of classes with cut.

hc <- hclust(dist(USArrests), "ave")

classification<-cutree(hc,h=70)

dend1 <- as.dendrogram(hc)
dend2 <- cut(dend1, h = 70)


str(dend2$lower[[1]]) #group 1 here is not the same as
classification[classification==1] #group 1 here

Is there a way to either get the classifications to map to each other, or alternatively to extract lower branch memberships from the dendrogram object (perhaps with some clever use of dendrapply?) in a format more like what cutree gives?

回答1:

I would propose for you to use the cutree function from the dendextend package. It includes a dendrogram method (i.e.: dendextend:::cutree.dendrogram).

You can learn more about the package from its introductory vignette.

I should add that while your function (classify) is good, there are several advantage for using cutree from dendextend:

  1. It also allows you to use a specific k (number of clusters), and not just h (a specific height).

  2. It is consistent with the result you would get from cutree on hclust (classify will not be).

  3. It will often be faster.

Here are examples for using the code:

# Toy data:
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

# Get the package:
install.packages("dendextend")
library(dendextend)

# Get the package:
cutree(dend1,h=70) # it now works on a dendrogram
# It is like using:
dendextend:::cutree.dendrogram(dend1,h=70)

By the way, on the basis of this function, dendextend allows the user to do more cool things, like color branches/labels based on cutting the dendrogram:

dend1 <- color_branches(dend1, k = 4)
dend1 <- color_labels(dend1, k = 5)
plot(dend1)

Lastly, here is some more code for demonstrating my other points:

# This would also work with k:
cutree(dend1,k=4)

# and would give identical result as cutree on hclust:
identical(cutree(hc,h=70)  , cutree(dend1,h=70)  )
   # TRUE

# But this is not the case for classify:
identical(classify(dend1,70)   , cutree(dend1,h=70)  )
   # FALSE


install.packages("microbenchmark")
require(microbenchmark)
microbenchmark(classify = classify(dend1,70),
               cutree = cutree(dend1,h=70)  )
#    Unit: milliseconds
#        expr      min       lq   median       uq       max neval
#    classify  9.70135  9.94604 10.25400 10.87552  80.82032   100
#      cutree 37.24264 37.97642 39.23095 43.21233 141.13880   100
# 4 times faster for this tree (it will be more for larger trees)

# Although (if to be exact about it) if I force cutree.dendrogram to not go through hclust (which can happen for "weird" trees), the speed will remain similar:
microbenchmark(classify = classify(dend1,70),
               cutree = cutree(dend1,h=70, try_cutree_hclust = FALSE)  )
# Unit: milliseconds
#        expr       min        lq    median       uq      max neval
#    classify  9.683433  9.819776  9.972077 10.48497 29.73285   100
#      cutree 10.275839 10.419181 10.540126 10.66863 16.54034   100

If you are thinking of ways to improve this function, please patch it through here:

https://github.com/talgalili/dendextend/blob/master/R/cutree.dendrogram.R

I hope you, or others, will find this answer helpful.



回答2:

I ended up creating a function to do it using dendrapply. It's not elegant, but it works

classify <- function(dendrogram,height){

#mini-function to use with dendrapply to return tip labels
 members <- function(n) {
    labels<-c()
    if (is.leaf(n)) {
        a <- attributes(n)
        labels<-c(labels,a$label)
    }
    labels
 }

 dend2 <- cut(dendrogram,height) #the cut dendrogram object
 branchesvector<-c()
 membersvector<-c()

 for(i in 1:length(dend2$lower)){                             #for each lower tree resulting from the cut
  memlist <- unlist(dendrapply(dend2$lower[[i]],members))     #get the tip lables
  branchesvector <- c(branchesvector,rep(i,length(memlist)))  #add the lower tree identifier to a vector
  membersvector <- c(membersvector,memlist)                   #add the tip labels to a vector
 }
out<-as.integer(branchesvector)                               #make the output a list of named integers, to match cut() output
names(out)<-membersvector
out
}

Using the function makes it clear that the problem is that cut assigns category names alphabetically while cutree assigns branch names left to right.

hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

classify(dend1,70) #Florida 1, North Carolina 1, etc.
cutree(hc,h=70)    #Alabama 1, Arizona 1, Arkansas 1, etc.