The issues:
1) Spark doesn't call UDF if input is column of primitive type that contains null
:
inputDF.show()
+-----+
| x |
+-----+
| null|
| 1.0|
+-----+
inputDF
.withColumn("y",
udf { (x: Double) => 2.0 }.apply($"x") // will not be invoked if $"x" == null
)
.show()
+-----+-----+
| x | y |
+-----+-----+
| null| null|
| 1.0| 2.0|
+-----+-----+
2) Can't produce null
from UDF as a column of primitive type:
udf { (x: String) => null: Double } // compile error
Accordingly to the docs:
Note that if you use primitive parameters, you are not able to check
if it is null or not, and the UDF will return null for you if the
primitive input is null. Use boxed type or [[Option]] if you wanna do
the null-handling yourself.
So, the easiest solution is just to use
boxed types if your UDF input is nullable column of primitive type
OR/AND you need to output null from UDF as a column of primitive type:
inputDF
.withColumn("y",
udf { (x: java.lang.Double) =>
(if (x == null) 1 else null): java.lang.Integer
}.apply($"x")
)
.show()
+-----+-----+
| x | y |
+-----+-----+
| null| null|
| 1.0| 2.0|
+-----+-----+
I would also use Artur's solution, but there is also another way without using javas wrapper classes by using struct
:
import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.Row
inputDF
.withColumn("y",
udf { (r: Row) =>
if (r.isNullAt(0)) Some(1) else None
}.apply(struct($"x"))
)
.show()
Based on the solution provided at SparkSQL: How to deal with null values in user defined function? by @zero323, an alternative way to achieve the requested result is:
import scala.util.Try
val udfHandlingNulls udf((x: Double) => Try(2.0).toOption)
inputDF.withColumn("y", udfHandlingNulls($"x")).show()