Customize Distance Formular of K-means in Apache S

2020-03-26 08:04发布

Now I'm using K-means for clustering and following this tutorial and API.

But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark?

标签： apache-spark k-means apache-spark-mllib apache-spark-ml

1条回答

在下西门庆

2楼-- · 2020-03-26 08:23

In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances.

See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation.

Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore providing a custom metric as a Python function, wouldn't be technically possible without significant changes in the API.

Please note that since Spark 2.4 there are two built-in measures that can be used with pyspark.ml.clustering.KMeans and pyspark.ml.clustering.BisectingKMeans. (see DistanceMeasure Param).

euclidean for Euclidean distance.
cosine for cosine distance.

Use at your own risk.

0人赞添加讨论(0) 举报

Customize Distance Formular of K-means in Apache S

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间