How to compute cluster assignments from linkage/di

2020-02-10 02:18发布

问题:

if you have this hierarchical clustering call in scipy in Python:

from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(dist_matrix), linkage_method)

then what's an efficient way to go from this to cluster assignments for individual points? i.e. a vector of length N where N is number of points, where each entry i is the cluster number of point i, given the number of clusters generated by a given threshold thresh on the resulting clustering?

To clarify: The cluster number would be the cluster that it's in after applying a threshold to the tree. In which case you would get a unique cluster for each leaf node for the cluster that it is in. Unique in the sense that each point belongs to one "most specific cluster" which is defined by the threshold where you cut the dendrogram.

I know that scipy.cluster.hierarchy.fclusterdata gives you this cluster assignment as its return value, but I am starting from a custom made distance matrix and distance metric, so I cannot use fclusterdata. The question boils down to: how can I compute what fclusterdata is computing -- the cluster assignments?

回答1:

If I understand you right, that is what fcluster does:

scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)

Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.

...

Returns: An array of length n. T[i] is the flat cluster number to which original observation i belongs.

So just call fcluster(linkage_matrix, t), where t is your threshold.



回答2:

If you'd like to see the members at every cluster level and in what order they are agglomerated see https://stackoverflow.com/a/43170608/5728789