Now I'm using K-means for clustering and following this tutorial and API.
But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark?
Now I'm using K-means for clustering and following this tutorial and API.
But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark?
In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances.
See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation.
Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore providing a custom metric as a Python function, wouldn't be technically possible without significant changes in the API.
Please note that since Spark 2.4 there are two built-in measures that can be used with
pyspark.ml.clustering.KMeans
andpyspark.ml.clustering.BisectingKMeans
. (seeDistanceMeasure Param
).Use at your own risk.