Why not use just Canopy clustering instead of comb

2019-08-03 03:05发布

The question is in the title - if Canopy can be used for clustering, as well as for determining centroids, why not use it for clustering, instead of using it just to generate centroids as input for KMeans clustering?

I'm considering implementation using Mahout, but I think that this is more a concept, not too much related to system.

Thanks

标签： machine-learning mahout

1条回答

虎瘦雄心在

2楼-- · 2019-08-03 03:31

Canopy is deprecated from Mahout so I wouldn't use it at all.

It is fast so the idea was to make a quick better than random estimate of starting centroids so that kmeans converged quicker.

Canopy has no convergence criteria so it's first guess is all you get. Kmeans iterates following an algorithm called gradient descent to find local minimums of the defined error function. So it converges towards better guesses but generally you start from a random centroid hoping that it was placed well. Canopy was an attempt to place the starting centroid better but did not work much if at all better than random.

So you could just take Canopy's guess and calculate clusters by going through all vectors and finding which canopy centroid they were closest to but the clusters would not have the benefit of iteration and would score worse on cross validation tests.

0人赞添加讨论(0) 举报

Why not use just Canopy clustering instead of comb

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间