Drop duplicates for each partition

2019-03-06 07:24发布

站内文章 / Spark

51 0

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

origin data

cls, id  
----
a, 1
a, 1
----
b, 3
b, 3
b, 4

expected output

cls, id  
----
a, 1
----
b, 3
b, 4

id can be duplicates only in same cls, It means same id do not exist across clses.

In that case.

df.dropDuplicates($id)

will shuffle across all partitions to check duplicates over cls. and repartitioned to 200(default value)

Now, How can I run dropDuplicates for each partition seperately to reduce computing cost?

something like

df.foreachPartition(_.dropDuplicates())

You're probably after something like this:

val distinct = df.mapPartitions(it => {
    val set = Set();
    while (it.hasNext) {
        set += it.next()
    }
    return set.iterator
});

标签： apache-spark apache-spark-sql

仙女界的扛把子

女 | 书童

私信

Ta的文章更多文章

0条评论

还没有人评论过~