Why not use just Canopy clustering instead of comb

2019-08-03 02:49发布

问题:

The question is in the title - if Canopy can be used for clustering, as well as for determining centroids, why not use it for clustering, instead of using it just to generate centroids as input for KMeans clustering?

I'm considering implementation using Mahout, but I think that this is more a concept, not too much related to system.

Thanks

回答1:

Canopy is deprecated from Mahout so I wouldn't use it at all.

It is fast so the idea was to make a quick better than random estimate of starting centroids so that kmeans converged quicker.

Canopy has no convergence criteria so it's first guess is all you get. Kmeans iterates following an algorithm called gradient descent to find local minimums of the defined error function. So it converges towards better guesses but generally you start from a random centroid hoping that it was placed well. Canopy was an attempt to place the starting centroid better but did not work much if at all better than random.

So you could just take Canopy's guess and calculate clusters by going through all vectors and finding which canopy centroid they were closest to but the clusters would not have the benefit of iteration and would score worse on cross validation tests.