In Spark version 1.2.0 one could use subtract
with 2 SchemRDD
s to end up with only the different content from the first one
val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)
onlyNewData
contains the rows in todaySchemRDD
that do not exist in yesterdaySchemaRDD
.
How can this be achieved with DataFrames
in Spark version 1.3.0?
According to the api docs, doing:
dataFrame1.except(dataFrame2)
will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.
In pyspark DOCS it would be subtract
df1.subtract(df2)
I tried subtract, but the result was not consistent.
If I run df1.subtract(df2)
, not all lines of df1 are shown on the result dataframe, probably due distinct
cited on the docs.
This solved my problem:
df1.exceptAll(df2)