Is there an easy way to calculate lowest value of h
in cut
that produces groupings of a given minimum size?
In this example, if I wanted clusters with at least ten members each, I should go with h = 3.80
:
# using iris data simply for reproducible example
data(iris)
d <- data.frame(scale(iris[,1:4]))
hc <- hclust(dist(d))
plot(hc)
cut(as.dendrogram(hc), h=3.79) # produces 5 groups; group 4 has 7 members
cut(as.dendrogram(hc), h=3.80) # produces 4 groups; no group has <10 members
Since the heights of the splits are given in hc$height
, I could create a set of candidate values using hc$height + 0.00001
and then loop through cuts at each of them. However, I don't see how to parse the cluster size members
out of the dendrogram
class. For example, cut(as.dendrogram(hc), h=3.80)$lower[[1]]$members
returns NULL
, not 66 as desired.
Please note that this is a simpler question than Cutting dendrogram into n trees with minimum cluster size in R which uses the package dynamicTreeCut
; here I am not specifying number of trees, just minimum cluster size. TYVM.
Thanks to @Vlo and @lukeA I'm able to implement a loop. However, I am just posting this for a starting point and certainly open to a more elegant solution.
unnest <- function(x) { # from Vlo's answer
if(is.null(names(x))) x
else c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))
}
cuts <- hc$height + 1e-9
min_size <- 10
smallest <- 0
i <- 0
while(smallest < min_size & i <= length(cuts)){
h_i <- cuts[i <- i+1]
if(i > length(cuts)){
warning("Couldn't find a cluster big enough.")
}
else smallest <-
Reduce(min,
lapply(X = unnest(cut(as.dendrogram(hc), h=h_i)$lower),
FUN = attr, which = "members") ) # from lukeA's comment
}
h_i # returns desired output: [1] 3.79211
This feature is available in the dendextend package with the heights_per_k.dendrogram
function (which also has a faster C++ implementation when loading the dendextendRcpp function).
## Not run:
hc <- hclust(dist(USArrests[1:4,]), "ave")
dend <- as.dendrogram(hc)
heights_per_k.dendrogram(dend)
## 1 2 3 4
##86.47086 68.84745 45.98871 28.36531
As a sidenote, the dendextend package has a cutree.dendrogram
S3 method for dendrograms (which works very similarly to cutree for hclust objects).
This doesn't answer the question, but might be useful for members
extraction if you decide to loop through the h
.
Stealing and modifying some code from here
# Unnest the list/dendogram structure
unnest <- function(x) {
if(is.null(names(x))) {
x
}
else {
c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))
}
}
# Extract the `members` attribute from each dendogram
lapply(X = unnest(cut(as.dendrogram(hc), h=3.8)), FUN = attr, which = "members")
Output:
# Please don't ask me why there are 2 dendograms stored
# in the `$upper` list while `print` displays one
$upper1
[1] 2
$upper2
[1] 2
$lower1
[1] 66
$lower2
[1] 11
$lower3
[1] 24
$lower4
[1] 49