I'm using Spark ML for run Kmeans. I have bunch of data and three existing centers, for example the three centers are:[1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0].
So how can I indicate the Kmeans centers are the above three vectors.
I saw Kmean object has seed parameter, but the seed parameter is an long type not an array. So how can I tell Spark Kmeans to only use the existing centers for clustering.
Or say, I didn't understand what does seed mean in Spark Kmeans, I suppose the seeds should be an array of vectors which represents the specified centers before running clustering.
Indeed,
seed
does not mean what you think, i.e. it is not used for 'seeding' (initializing) the cluster centers, but simply for setting the random seed - you can confirm this in the documentation for the Scala and Python APIs.To the best of my knowledge, there is currently (Spark 2.1) no way for supplying initial cluster centers for k-means in Spark ML (see this answer for Spark MLlib). The
initMode
parameter, according to the documentation: