Optimizing K-means algorithm

2019-05-21 10:16发布

问题:

Am trying to follow a paper called An Optimized Version of K-Means Algorithm. I have the idea on how K-Means algorithm works. That is, grouping the tuples/points into clusters and updating the centroids.

Am trying to implement the method mentioned in the above paper. Their proposed algorithm is this:

So my doubt is in the second step. I didn't understood what it is being done there! In the paper it says that, we group our data to wider intervals based on the value of e, so that later we could avoid iterating through the entire dataset. So, actually how we store it in that I (intervals) ? Are we supposed to define a multidimensional array? It's not making sense to me(probably am dumb enough not to get the idea).

The next thing I have doubt is about the step 5. In there, Cw is said to be the closest centroid of that point. But how do we figure that out? At first, we would be randomly assigning a point as the centroid. So, are we supposed to loop through the points and find out the Cw (the closest centroid) before calculating the e ?

The next doubt is on Step 6, which I guess will be able to understand after I get the idea about my previous question regarding Step2.

And the final doubt is regarding Step7. What does it mean by CjCj' ? Distance between the previous centroid position to the updated centroid position ?

I have been brain storming about this since the whole day. Any clues would be highly appreciated. Thank you

回答1:

This algorithm revolves around the idea that points which are close to one cluster center and further away from all other cluster centers will stick to that cluster center. Thus, in subsequent iterations these points don't have to be compared to all cluster centers.

Imagine a point P which has distance 3 to its assigned cluster (i.e. that is the closest distance) and all other clusters are at least 8 or more away. Now, compute the new cluster centers. Let's say that the most any cluster center moved was 2 (that is the value of D in the algorithm, line 7).

You can now be sure that your point still belongs to the same cluster. This is easy to understand if you think about the worst case scenario: The assigned cluster could have moved away from the point so it could be at worst have a new distance of 3+2. The closest other cluster could have moved towards the point so its distance could be now 8-2. Therefore, you don't need to update that point.

Here's a picture:

Your specific questions:

Step 2: Create Intervals into which you later put the points. In step 5 you'll create e values. If e is 5, you'd put that point into the interval [4, 6) (if you'd have that interval).

Step 5: Compute the distance of the point to all clusters. The closest cluster is Cw. If the next closest cluster is C3 than e = C3 - Cw. I'd call e the safety margin. It gives you the maximal distance a cluster can move before assignments might switch.

Step 6: explained with step 2.

Step 7: CjCj' is the distance that Cj moved when it was updated.

Step 7 & 8: These are done as part of the for loop. (In Step 9 they say "go back to 4").

Step 9: Continue the loop with all points that might have changed clusters.

More general

This algorithm assumes that you know k-means. I believe this is a very fair assumption. It is not very explicit when to terminate and how to initialise the algorithm. Both of these facts I'd consider common knowledge.

  • Initialise the cluster centers by assigning one random point to each. Other initialisation routines exist. The algorithm is not concerned with this part of the algorithm.
  • Terminate the algorithm when there are no more intervals which tag is less than 0. This is equivalent to the normal termination criterion in k-means: Stop when no point is assigned to a different cluster.


回答2:

Second step: The intervals are on the distance metric, not on coordinates of data.

Fifth step: This is essentially distance to the second closest cluster.

Let me know if you still don't understand the rest.



回答3:

It could be beside the point, if you insist on exactly this algorithm, but the state-of-the-art in K-means is the algorithm described by Ravi Kannan in

https://www.youtube.com/watch?v=n3LQqUQb870

Just wanted to say this in case other people are trying to implement k-means clustering and come to this page. The difficulty in K-means is the initialization. The paper you are talking about describes a running time optimization, but this optimization does not improve the clustering itself. This algorithm also works only in 2D.

The state-of-the-art is to run SVD first and do clustering in the principle subspace of dim k. The SVD also gives a nice initialization strategy.

There are other tricks in k-means like thresholding the vectors and so on. The lectures by Ravi Kannan in youtube should give you a hint.

Another simple trick which often works is the so called furthest first traversal.