可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:

df1.except(df2).union(df2.except(df1))

But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.

回答1:

You can always rewrite it as:

df1.unionAll(df2).except(df1.intersect(df2))

Seriously though this UNION, INTERSECT and EXCEPT / MINUS is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.

回答2:

why not the below?

df1.except(df2)

回答3:

Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:

https://issues.apache.org/jira/browse/SPARK-21274

As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as

SELECT a,b,c
FROM    tab1 t1
     LEFT OUTER JOIN 
        tab2 t2
     ON (
        (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
     )
WHERE
    COALESCE(t2.a, t2.b, t2.c) IS NULL

回答4:

If you are looking for Pyspark solution, you should use subtract() docs.

Also, unionAll is deprecated in 2.0, use union() instead.

df1.union(df2).subtract(df1.intersect(df2))

回答5:

I think it could be more efficient using a left join and then filtering out the nulls.

df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)

How to obtain the symmetric difference between two

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

收藏的人(0)

How to obtain the symmetric difference between two

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮