Spark get top N highest score results for each (it

2019-06-12 08:09发布

站内文章 / Spark

33 0

在下西门庆

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a DataFrame of the following format:

item_id1: Long, item_id2: Long, similarity_score: Double

What I'm trying to do is to get top N highest similarity_score records for each item_id1. So, for example:

With top 2 similar items would give:

I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to implement it in Spark Scala.

Thank you.

回答1:

I would suggest to use window-functions for this:

 df
  .withColumn("rnk",row_number().over(Window.partitionBy($"item_id1").orderBy($"similarity_score")))
  .where($"rank"<=2)

Alternatively, you could use dense_rank/rank instead of row_number, depending on how to handle cases where the similarity-score is equal.

标签： scala apache-spark spark-dataframe rdd

在下西门庆

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Spark get top N highest score results for each (it

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮