In sklearn there is one agglomerative clustering algorithm implemented, the ward method minimizing variance. Usually sklearn is documented with lots of nice usage examples, but I couldn't find examples of how to use this function.
Basically my problem is to draw a dendrogram according to the clustering of my data, but I don't understand the output from the function. The documentation says that it returns the children, the number of components, the number of leaves and the parents of each node.
Yet for my data samples, the results don't give any meaning. For a (32,542) matrix that has been clustered with a connectivity matrix this is the output:
>>> wt = ward_tree(mymat, connectivity=connectivity, n_clusters=2)
>>> mymat.shape
(32, 542)
>>> wt
(array([[16, 0],
[17, 1],
[18, 2],
[19, 3],
[20, 4],
[21, 5],
[22, 6],
[23, 7],
[24, 8],
[25, 9],
[26, 10],
[27, 11],
[28, 12],
[29, 13],
[30, 14],
[31, 15],
[34, 33],
[47, 46],
[41, 40],
[36, 35],
[45, 44],
[48, 32],
[50, 42],
[38, 37],
[52, 43],
[54, 39],
[53, 51],
[58, 55],
[56, 49],
[60, 57]]), 1, 32, array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 53, 48,
48, 51, 51, 55, 55, 57, 50, 50, 54, 56, 52, 52, 49, 49, 53, 60, 54,
58, 56, 58, 57, 59, 60, 61, 59, 59, 61, 61]))
In this case I asked for two clusters, with 32 vectors containing features. But how are the two clusters visible in the data? Where are they? And what do the children really mean here, how can the children be higher numbers than the total number of samples?
About the first argument of output, the documentation says
I had some trouble figuring what this means, but then this code helped. We generate normally distributed data with two "clusters", one with 3 data points with mean 0, and one with 2 data points with mean 100. So we expect that the 3 first data point will end up in one branch of the output tree and the the other 2 in another.
Which produces the tree:
where the numbers are node id's. If node_id < 5 (the number of samples) then it's an index to a data point (or leaf node). If node_id >= 5 then it's an internal node. We see that the data clusters as expected: