I am having trouble fully understanding the k-means++ algorithm. I am interested exactly how the first k centroids are picked (the rest is like in the original k-means).
- Is the probability function used based on distance or Gaussian?
- In the same time the most long distant point (from the other centroids) is picked for a new centroid.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
Interesting question. Thank you for bringing this paper to my attention. PDF link here of the original paper.
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector
x
is high ifx
is not near any previously chosen centers.Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center,
c1
, be 0. The probability that the next cluster center,c2
, is x is proportional to||c1-x||^2
. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
I've coded the initialization procedure in Python; I don't know if this helps you.
EDIT with clarification: The output of
cumsum
gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, sincer
is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because ofbreak
). Thefor
loop checks to see which partitionr
is in.Example:
One Liner.
Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.
I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.
Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.
data.txt: