How do you visualize a ward tree from sklearn.clus

2019-03-26 04:02发布

问题:

In sklearn there is one agglomerative clustering algorithm implemented, the ward method minimizing variance. Usually sklearn is documented with lots of nice usage examples, but I couldn't find examples of how to use this function.

Basically my problem is to draw a dendrogram according to the clustering of my data, but I don't understand the output from the function. The documentation says that it returns the children, the number of components, the number of leaves and the parents of each node.

Yet for my data samples, the results don't give any meaning. For a (32,542) matrix that has been clustered with a connectivity matrix this is the output:

>>> wt = ward_tree(mymat, connectivity=connectivity, n_clusters=2)

>>> mymat.shape
(32, 542)
>>> wt
(array([[16,  0],
       [17,  1],
       [18,  2],
       [19,  3],
       [20,  4],
       [21,  5],
       [22,  6],
       [23,  7],
       [24,  8],
       [25,  9],
       [26, 10],
       [27, 11],
       [28, 12],
       [29, 13],
       [30, 14],
       [31, 15],
       [34, 33],
       [47, 46],
       [41, 40],
       [36, 35],
       [45, 44],
       [48, 32],
       [50, 42],
       [38, 37],
       [52, 43],
       [54, 39],
       [53, 51],
       [58, 55],
       [56, 49],
       [60, 57]]), 1, 32, array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,     45, 46, 47, 32,
       33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 53, 48,
   48, 51, 51, 55, 55, 57, 50, 50, 54, 56, 52, 52, 49, 49, 53, 60, 54,
   58, 56, 58, 57, 59, 60, 61, 59, 59, 61, 61]))

In this case I asked for two clusters, with 32 vectors containing features. But how are the two clusters visible in the data? Where are they? And what do the children really mean here, how can the children be higher numbers than the total number of samples?

回答1:

About the first argument of output, the documentation says

The children of each non-leaf node. Values less than n_samples refer to leaves of the tree. A greater value i indicates a node with children children[i - n_samples].

I had some trouble figuring what this means, but then this code helped. We generate normally distributed data with two "clusters", one with 3 data points with mean 0, and one with 2 data points with mean 100. So we expect that the 3 first data point will end up in one branch of the output tree and the the other 2 in another.

from sklearn.cluster import ward_tree
import numpy as np
import itertools
X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
w = ward_tree(X)
ii = itertools.count(w[2])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in w[0]]

Which produces the tree:

[{'node_id': 5, 'right': 2, 'left': 1},
 {'node_id': 6, 'right': 4, 'left': 3}, 
 {'node_id': 7, 'right': 5, 'left': 0}, 
 {'node_id': 8, 'right': 7, 'left': 6}]

where the numbers are node id's. If node_id < 5 (the number of samples) then it's an index to a data point (or leaf node). If node_id >= 5 then it's an internal node. We see that the data clusters as expected:

         8
     /       \
    7         \  
   / \         \
  5   \         6
 / \   \       / \
1   2   0     3   4