subtract two columns with null in spark dataframe

2020-03-30 01:41发布

问题:

I new to spark, I have dataframe df:

+----------+------------+-----------+
| Column1  | Column2    | Sub       |                          
+----------+------------+-----------+
| 1        | 2          | 1         |                                         
+----------+------------+-----------+
| 4        | null       | null      |                          
+----------+------------+-----------+
| 5        | null       | null      |                          
+----------+------------+-----------+
| 6        | 8          | 2         |                          
+----------+------------+-----------+

when subtracting two columns, one column has null so resulting column also resulting as null.

df.withColumn("Sub", col(A)-col(B))

Expected output should be:

+----------+------------+-----------+
|  Column1 | Column2    | Sub       |                          
+----------+------------+-----------+
| 1        | 2          | 1         |                                           
+----------+------------+-----------+
| 4        | null       | 4         |                          
+----------+------------+-----------+
| 5        | null       | 5         |                          
+----------+------------+-----------+
| 6        | 8          | 2         |                          
+----------+------------+-----------+

I don't want to replace the column2 to replace with 0, it should be null only. Can someone help me on this?

回答1:

You can use when function as

import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull, lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull, lit(0)).otherwise(col("Column2")))

you should have final result as

+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
|      1|      2|-1.0|
|      4|   null| 4.0|
|      5|   null| 5.0|
|      6|      8|-2.0|
+-------+-------+----+


回答2:

You can coalesce nulls to zero on both columns and then do the subtraction:

val df = Seq((Some(1), Some(2)), 
             (Some(4), null), 
             (Some(5), null), 
             (Some(6), Some(8))
            ).toDF("A", "B")

df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
|  A|   B|Sub|
+---+----+---+
|  1|   2|  1|
|  4|null|  4|
|  5|null|  5|
|  6|   8|  2|
+---+----+---+