The issues:
1) Spark doesn't call UDF if input is column of primitive type that contains null
:
inputDF.show()
+-----+
| x |
+-----+
| null|
| 1.0|
+-----+
inputDF
.withColumn("y",
udf { (x: Double) => 2.0 }.apply($"x") // will not be invoked if $"x" == null
)
.show()
+-----+-----+
| x | y |
+-----+-----+
| null| null|
| 1.0| 2.0|
+-----+-----+
2) Can't produce null
from UDF as a column of primitive type:
udf { (x: String) => null: Double } // compile error
Based on the solution provided at SparkSQL: How to deal with null values in user defined function? by @zero323, an alternative way to achieve the requested result is:
Accordingly to the docs:
So, the easiest solution is just to use boxed types if your UDF input is nullable column of primitive type OR/AND you need to output null from UDF as a column of primitive type:
I would also use Artur's solution, but there is also another way without using javas wrapper classes by using
struct
: