I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it.
Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict
on all training vectors and count the label which appear the most.
Is there another way to do this?
Thank you
You are right, this info is not provided by the model, and you have to run
predict
. Here is an example of doing so in a parallelized way (Spark v. 1.5.1):cluster_ind
is an RDD of the same cardinality with our initial data, and it shows which cluster each datapoint belongs to. So, here we have two clusters, one with 3 datapoints (cluster 0) and one with 2 datapoints (cluster 1). Notice that we have run the prediction method in a parallel fashion (i.e. on an RDD) -collect()
is used here only for our demonstration purposes, and it is not needed in a 'real' situation.Now, we can get the cluster sizes with
From this, we can get the maximum cluster index & size as
i.e. our biggest cluster is cluster 0, with a size of 3 datapoints, which can be easily verified by inspection of
cluster_ind.collect()
above.