I have two dataframes.
df1
+--------+-------------------
|id | amount | fee |
|1 | 10.00 | 5.0 |
|2 | 20.0 | 3.0 |
|3 | 90 | 130.0 |
|4 | 120.0 | 35.0 |
df2
+--------+--------------------
|exId | exAmount | exFee|
|1 | 10.00 | 5.0 |
|2 | 20.0 | 3.0 |
|3 | 20.0 | 3.0 |
|4 | 120.0 | 3.0 |
I need to perform the following operations
- Find common rows in which all three columns match for e.g. id 1,2 in the above example.
- Find common rows in which (id, exId) match but others don't i.e. for e.g. 3 & 4 in the above example. It would be useful if we identify which of the columns didn't match.
So the output would look like this
exact Match
+--------+---------------------------------------------
|id | amount | fee | exId | exAmount | exFee |
|1 | 10.00 | 5.0 | 1 | 10.00 | 5.0 |
|2 | 20.0 | 3.0 | 2 | 20.00 | 3.0 |
+--------+---------------------------------------------
non-exact match
+--------+------------------------------------------------------------
|id | amount | fee | exId | exAmount | exFee | mismatchFields|
|3 | 90.00 | 130.0 | 3 | 20.00 | 3.0 | [fee, amount]|
|4 | 120.0 | 35.0 | 4 | 120.00 | 3.0 | [fee] |
+--------+------------------------------------------------------------
Any thoughts?
Find common rows in which all three columns match for e.g. id 1,2 in the above example.
this is quite easy, you just have to check all the columns for equality while joining
df1.join(df2, df1("id") === df2("exId") && df1("amount") === df2("exAmount") && df1("fee") === df2("exFee")).show(false)
which should give you
+---+------+---+----+--------+-----+
|id |amount|fee|exId|exAmount|exFee|
+---+------+---+----+--------+-----+
|1 |10.00 |5.0|1 |10.00 |5.0 |
|2 |20.0 |3.0|2 |20.0 |3.0 |
+---+------+---+----+--------+-----+
Find common rows in which (id, exId) match but others don't i.e. for e.g. 3 & 4 in the above example. It would be useful if we identify which of the columns didn't match.
for this you have to check for equality for the first column but en-equality for the rest two columns and do some when condition to get the last column
import org.apache.spark.sql.functions._
df1.join(df2, df1("id") === df2("exId") && (df1("amount") =!= df2("exAmount") || df1("fee") =!= df2("exFee")))
.withColumn("mismatchFields", when(col("amount") =!= col("exAmount") && col("fee") =!= col("exFee"), array(lit("amount"), lit("fee"))).otherwise(
when(col("amount") === col("exAmount") && col("fee") =!= col("exFee"), array(lit("fee"))).otherwise(array(lit("amount")))
)).show(false)
which should give you
+---+------+-----+----+--------+-----+--------------+
|id |amount|fee |exId|exAmount|exFee|mismatchFields|
+---+------+-----+----+--------+-----+--------------+
|3 |90 |130.0|3 |20.0 |3.0 |[amount, fee] |
|4 |120.0 |35.0 |4 |120.0 |3.0 |[fee] |
+---+------+-----+----+--------+-----+--------------+
I hope the answer is helpful
- First do Inner join and get matching data from both dataframe based on ID.
- Create a new column match based on amount and fee
- Filter based on match column
val joinedDF = df1.join(df2,df1.col("id")===df2.col("exId"))
.withColumn("match",when(col("fee")===col("exFee") &&
col("amount")===col("exAmount"),lit(1))
.otherwise(lit(0)))
Matched Data:
val matchedDF = joinedDF.filter("match=1")
Non Matched Data:
val notMatchedDF = joinedDF.filter("match=0")
.withColumn("mismatchedFields",when(col("fee")!=col("exFee") &&
col("amount")!=col("exAmount"),array("fee","amount"))
.otherwise(when(col("fee")!=col("exFee") ,array("fee"))))