Extract kmeans cluster information using Apache Sp

2020-07-23 08:32发布


I've implemented the Apache Spark example at


Here is the source :

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

Using dataset :

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

I can extract the cluster centers using :


which returns


But there are some items I'm not sure of, which does not seem to be supported by the API :

How can I extract what points have been added to each of the two clusters ?

How to add labels to each data point so that while viewing what points are in each cluster can also determine each points label ? Do I need to update the Spark Kmeans implementation to achieve this ?


if you are using java,

javaRDD cluster_indices = clusters.predict(parsedData);

as predict is overloaded.


The method that you are looking for is predict() but does not belong to KMeans.scala. Is part of the class KMeansModel.scala (which is the return type of KMeans.train(...) )

The use would be:
