Is there an easy way to calculate lowest value of h
in cut
that produces groupings of a given minimum size?
In this example, if I wanted clusters with at least ten members each, I should go with h = 3.80
:
# using iris data simply for reproducible example
data(iris)
d <- data.frame(scale(iris[,1:4]))
hc <- hclust(dist(d))
plot(hc)
cut(as.dendrogram(hc), h=3.79) # produces 5 groups; group 4 has 7 members
cut(as.dendrogram(hc), h=3.80) # produces 4 groups; no group has <10 members
Since the heights of the splits are given in hc$height
, I could create a set of candidate values using hc$height + 0.00001
and then loop through cuts at each of them. However, I don't see how to parse the cluster size members
out of the dendrogram
class. For example, cut(as.dendrogram(hc), h=3.80)$lower[[1]]$members
returns NULL
, not 66 as desired.
Please note that this is a simpler question than Cutting dendrogram into n trees with minimum cluster size in R which uses the package dynamicTreeCut
; here I am not specifying number of trees, just minimum cluster size. TYVM.
This feature is available in the dendextend package with the
heights_per_k.dendrogram
function (which also has a faster C++ implementation when loading the dendextendRcpp function).As a sidenote, the dendextend package has a
cutree.dendrogram
S3 method for dendrograms (which works very similarly to cutree for hclust objects).This doesn't answer the question, but might be useful for
members
extraction if you decide to loop through theh
.Stealing and modifying some code from here
Output:
Thanks to @Vlo and @lukeA I'm able to implement a loop. However, I am just posting this for a starting point and certainly open to a more elegant solution.