Cutting dendrogram into n trees with minimum clust

2020-06-18 06:38发布

问题:

I'm trying to use hirearchical clustering (specifically hclust) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut() and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming.

I've experimented with the dynamicTreeCut package, but can't figure out how to enter these (relatively simple) limitations. I'm using deepSplit as the way to designate the number of groupings, but following the documentation, this limits the maximum number to 4. For the exercise below, all I'm looking to do is to get the clusters into 5 groups of 3 or more individuals (I can deal with the maximum size limitation on my own, but if you want to try to tackle this too, it would be helpful!).

Here's my example, using the Orange dataset.

library(dynamicTreeCut)
library(reshape2)

##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)


####casting the data to make a correlation matrix, and then running 
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))

###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)

dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

As you can see, it only designates two clusters. For my exercise, I want to shy away from using an explicit height term to cut the trees, as I want a k number of trees instead.

回答1:

1- Figure out the most appropriate dissimilarity measure (e.g., "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski") and linkage method (e.g., "ward", "single", "complete", "average", "mcquitty", "median", or "centroid") based on the nature of your data and the objective(s) of clustering. See ?dist and ?hclust for more details.

2- Plot the dendogram tree before starting the cutting step. See ?hclust for more details.

3- Use the hybrid adaptive tree cut method in dynamicTreeCut package, and tune the shape parameters (maxCoreScatter and minGap / maxAbsCoreScatter and minAbsGap). See Langfelder et al. 2009 (http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement.pdf).


For your example,

1- Change "euclidean" and/or "complete" methods as appropriate,

orangeClust <- hclust(dist(orangeCorr, method="euclidean"), method="complete")

2- Plot dendogram,

plot(orangeClust)

3- Use the hybrid tree cut method and tune shape parameters,

dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=NULL, minGap=NULL, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
 ..cutHeight not given, setting it to 1.8  ===>  99% of the (truncated) height range in dendro.
 ..done.
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

As a guide for tuning the shape parameters, the default values are

deepSplit=0: maxCoreScatter = 0.64 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=1: maxCoreScatter = 0.73 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=2: maxCoreScatter = 0.82 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=3: maxCoreScatter = 0.91 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=4: maxCoreScatter = 0.95 & minGap = (1 - maxCoreScatter) * 3/4

As you can see, both maxCoreScatter and minGap should be between 0 and 1, and increasing maxCoreScatter (decreasing minGap) increases the number of clusters (with smaller sizes). The meaning of these parameters is described in Langfelder et al. 2009.

For example, to get more smaller clusters

maxCoreScatter <- 0.99
minGap <- (1 - maxCoreScatter) * 3/4
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
 ..cutHeight not given, setting it to 1.8  ===>  99% of the (truncated) height range in dendro.
 ..done.
 2 3 2 2 2 3 3 2 2 3 3 2 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 0 0

Finally, your clustering constraints (size, height, number, ... etc) should be reasonable and interpretable, and the generated clusters should agree with the data. This guides you to the important step of clustering validation and interpretation.


Good Luck!