可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have two dataframes.

df1

+--------+-------------------
|id  | amount       | fee   | 
|1   | 10.00        | 5.0   |
|2   | 20.0         | 3.0   |
|3   | 90           | 130.0 |
|4   | 120.0        | 35.0  |

df2

+--------+--------------------
|exId  | exAmount     | exFee| 
|1     | 10.00        | 5.0  |
|2     | 20.0         | 3.0  |
|3     | 20.0         | 3.0  |
|4     | 120.0        | 3.0  |

I need to perform the following operations

Find common rows in which all three columns match for e.g. id 1,2 in the above example.
Find common rows in which (id, exId) match but others don't i.e. for e.g. 3 & 4 in the above example. It would be useful if we identify which of the columns didn't match.

So the output would look like this

exact Match

+--------+---------------------------------------------
|id  | amount       | fee   | exId | exAmount | exFee | 
|1   | 10.00        | 5.0   | 1    |  10.00   | 5.0   |  
|2   | 20.0         | 3.0   | 2    |  20.00   | 3.0   |
+--------+---------------------------------------------

non-exact match

+--------+------------------------------------------------------------
|id  | amount       | fee   | exId | exAmount | exFee | mismatchFields|
|3   | 90.00        | 130.0 | 3    |  20.00   | 3.0   |  [fee, amount]|
|4   | 120.0        | 35.0  | 4    |  120.00  | 3.0   |  [fee]        |
+--------+------------------------------------------------------------

Any thoughts?

回答1:

Find common rows in which all three columns match for e.g. id 1,2 in the above example.

this is quite easy, you just have to check all the columns for equality while joining

df1.join(df2, df1("id") === df2("exId") && df1("amount") === df2("exAmount") && df1("fee") === df2("exFee")).show(false)

which should give you

+---+------+---+----+--------+-----+
|id |amount|fee|exId|exAmount|exFee|
+---+------+---+----+--------+-----+
|1  |10.00 |5.0|1   |10.00   |5.0  |
|2  |20.0  |3.0|2   |20.0    |3.0  |
+---+------+---+----+--------+-----+

Find common rows in which (id, exId) match but others don't i.e. for e.g. 3 & 4 in the above example. It would be useful if we identify which of the columns didn't match.

for this you have to check for equality for the first column but en-equality for the rest two columns and do some when condition to get the last column

import org.apache.spark.sql.functions._
df1.join(df2, df1("id") === df2("exId") && (df1("amount") =!= df2("exAmount") || df1("fee") =!= df2("exFee")))
.withColumn("mismatchFields", when(col("amount") =!= col("exAmount") && col("fee") =!= col("exFee"), array(lit("amount"), lit("fee"))).otherwise(
  when(col("amount") === col("exAmount") && col("fee") =!= col("exFee"), array(lit("fee"))).otherwise(array(lit("amount")))
)).show(false)

which should give you

+---+------+-----+----+--------+-----+--------------+
|id |amount|fee  |exId|exAmount|exFee|mismatchFields|
+---+------+-----+----+--------+-----+--------------+
|3  |90    |130.0|3   |20.0    |3.0  |[amount, fee] |
|4  |120.0 |35.0 |4   |120.0   |3.0  |[fee]         |
+---+------+-----+----+--------+-----+--------------+

I hope the answer is helpful

回答2:

First do Inner join and get matching data from both dataframe based on ID.
Create a new column match based on amount and fee
Filter based on match column

val joinedDF = df1.join(df2,df1.col("id")===df2.col("exId")) .withColumn("match",when(col("fee")===col("exFee") && col("amount")===col("exAmount"),lit(1)) .otherwise(lit(0)))

Matched Data:

val matchedDF = joinedDF.filter("match=1")

Non Matched Data:

val notMatchedDF = joinedDF.filter("match=0")
                        .withColumn("mismatchedFields",when(col("fee")!=col("exFee") &&
                                       col("amount")!=col("exAmount"),array("fee","amount"))
                                      .otherwise(when(col("fee")!=col("exFee") ,array("fee"))))