how to set Spark Kmeans initial centers

2019-08-02 16:20发布

I'm using Spark ML for run Kmeans. I have bunch of data and three existing centers, for example the three centers are:[1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0]. So how can I indicate the Kmeans centers are the above three vectors. I saw Kmean object has seed parameter, but the seed parameter is an long type not an array. So how can I tell Spark Kmeans to only use the existing centers for clustering.

Or say, I didn't understand what does seed mean in Spark Kmeans, I suppose the seeds should be an array of vectors which represents the specified centers before running clustering.

1条回答
The star\"
2楼-- · 2019-08-02 17:03

Indeed, seed does not mean what you think, i.e. it is not used for 'seeding' (initializing) the cluster centers, but simply for setting the random seed - you can confirm this in the documentation for the Scala and Python APIs.

To the best of my knowledge, there is currently (Spark 2.1) no way for supplying initial cluster centers for k-means in Spark ML (see this answer for Spark MLlib). The initMode parameter, according to the documentation:

can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++')

查看更多
登录 后发表回答