Spark: subtract two DataFrames

2019-01-07 13:17发布

站内文章 / Spark

36 0

我欲成王，谁敢阻挡

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.

How can this be achieved with DataFrames in Spark version 1.3.0?

回答1:

According to the api docs, doing:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

回答2:

In pyspark DOCS it would be subtract

df1.subtract(df2)

回答3:

I tried subtract, but the result was not consistent. If I run df1.subtract(df2), not all lines of df1 are shown on the result dataframe, probably due distinct cited on the docs.

This solved my problem: df1.exceptAll(df2)

标签： apache-spark dataframe rdd

我欲成王，谁敢阻挡

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Spark: subtract two DataFrames

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮