how to check differences in rows belonging to two

2019-09-20 12:59发布

问题:

I have two data frames, that represent two different period in times for the same people. I'd like to understand, for each row, if there have been any changes in the 5 (fixed) column of the two data frames.

Before:

+--+------+------+------+------+------+------+
|id| sport|  var1|  var2|  var3|  var4|  var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234|      |      |      |      |
| 2|soccer|  null|  null|  null|  null|  null|
| 3|soccer|330101|      |      |      |      |
| 4|soccer|  null|  null|  null|  null|  null|
| 5|soccer|  null|  null|  null|  null|  null|
| 6|soccer|  null|  null|  null|  null|  null|
| 7|soccer|  null|  null|  null|  null|  null|
| 8|soccer|330024|330401|      |      |      |
| 9|soccer|330055|330106|      |      |      |
|10|soccer|  null|  null|  null|  null|  null|
|11|soccer|390027|      |      |      |      |
|12|soccer|  null|  null|  null|  null|  null|
|13|soccer|330101|      |      |      |      |
|14|soccer|330059|      |      |      |      |
|15|soccer|  null|  null|  null|  null|  null|
|16|soccer|140242|140281|      |      |      |
|17|soccer|330214|      |      |      |      |
|18|soccer|      |      |      |      |      |
|19|soccer|330055|330196|      |      |      |
|20|soccer|210022|      |      |      |      |
+--+------+------+------+------+------+------+

After:

+--+------+------+------+------+------+------+
|id| sport|  var1|  var2|  var3|  var4|  var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234|      |      |      |      |
| 2|soccer|  null|  null|  null|  null|  null|
| 3|soccer|330101|      |      |      |      |
| 4|soccer|  null|  null|  null|  null|  null|
| 5|soccer|  null|  null|  null|  null|  null|
| 6|soccer|  null|  null|  null|  null|  null|
| 7|soccer|  null|  null|  null|  null|  null|
| 8|soccer|  null|  null|  null|  null|  null|
| 9|soccer|330106|      |      |      |      |
|10|soccer|  null|  null|  null|  null|  null|
|11|soccer|390027|      |      |      |      |
|12|soccer|  null|  null|  null|  null|  null|
|13|soccer|  null|  null|  null|  null|  null|
|14|soccer|330128|330331|330106|330059|      |
|15|soccer|  null|  null|  null|  null|  null|
|16|soccer|140242|140281|140010|      |      |
|17|soccer|330214|      |      |      |      |
|18|soccer|  null|  null|  null|  null|  null|
|19|soccer|330196|      |      |      |      |
|20|soccer|210022|      |      |      |      |
+--+------+------+------+------+------+------+

I know how to scan for differences in columns belonging to a row, but I am pretty clueless how to compare rows of two different data frames.

An ideal output would be:

+--+------+------+
|id| sport|  diff|
+--+------+------+
| 1|soccer|     0|
| 2|soccer|     0|
| 3|soccer|     0|
| 4|soccer|     0|
| 5|soccer|     0|
| 6|soccer|     0|
| 7|soccer|     0|
| 8|soccer|     1|
| 9|soccer|     1|
|10|soccer|     0|
|11|soccer|     0|
|12|soccer|     0|
|13|soccer|     1|
|14|soccer|     1|
|15|soccer|     0|
|16|soccer|     1| 
|17|soccer|     0| 
|18|soccer|     0| 
|19|soccer|     1| 
|20|soccer|     0| 

回答1:

Do you mean something like this? Lets start with example data:

val before = Seq(
  (1, "soccer", Some(1), Some(2), Some(3), Some(4), None),
  (2, "soccer", None,    Some(0), None,    None,    Some(0)),
  (3, "soccer", None,    None,    None,    None,    None)
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")

val after = Seq(
  (1, "soccer", Some(1), Some(2), Some(3), Some(4), None), // Zero diffs
  (2, "soccer", Some(1), Some(0), None,    None,    Some(0)), // One diff
  (3, "soccer", Some(1), Some(1), Some(1), Some(1), Some(1)) // Five diffs
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")

Generate an expression which counts differences:

// Extract var columns
val varCols = before.columns.drop(2)

// Generate a list of exprs 
// CAST(NOT(before.var1 <=> after.var1) AS INT)
val equalsExprs = varCols.map(
  c => not(col(s"before.$c") <=> col(s"after.$c")).cast("int").alias(s"${c}_ne"))

// SUM 
val diff = equalsExprs.foldLeft(lit(0))(_ + _).alias("diff")

It will treat:

  • two NULLs as equal
  • any value and NULL as not-equal
  • two not-NULL values - standard type equality

Join and select the expression:

val diffs = before.as("before").join(after.as("after"), Seq("id", "sport"))
  .select($"id", $"sport", diff)

diffs.show

// +---+------+----+ 
// | id| sport|diff|
// +---+------+----+
// |  1|soccer|   0|
// |  2|soccer|   1|
// |  3|soccer|   5|
// +---+------+----+