How to deal with Spark UDF input/output of primiti

The issues:

1) Spark doesn't call UDF if input is column of primitive type that contains null:

inputDF.show()

+-----+
|  x  |
+-----+
| null|
|  1.0|
+-----+

inputDF
  .withColumn("y",
     udf { (x: Double) => 2.0 }.apply($"x") // will not be invoked if $"x" == null
  )
  .show()

+-----+-----+
|  x  |  y  |
+-----+-----+
| null| null|
|  1.0|  2.0|
+-----+-----+

2) Can't produce null from UDF as a column of primitive type:

udf { (x: String) => null: Double } // compile error

标签： sql apache-spark null udf

3条回答

爷的心禁止访问

2楼-- · 2020-03-05 03:15

Based on the solution provided at SparkSQL: How to deal with null values in user defined function? by @zero323, an alternative way to achieve the requested result is:

import scala.util.Try
val udfHandlingNulls udf((x: Double) => Try(2.0).toOption)
inputDF.withColumn("y", udfHandlingNulls($"x")).show()

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2020-03-05 03:18

Accordingly to the docs:

Note that if you use primitive parameters, you are not able to check if it is null or not, and the UDF will return null for you if the primitive input is null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.

So, the easiest solution is just to use boxed types if your UDF input is nullable column of primitive type OR/AND you need to output null from UDF as a column of primitive type:

inputDF
  .withColumn("y",
     udf { (x: java.lang.Double) => 
       (if (x == null) 1 else null): java.lang.Integer
     }.apply($"x")
  )
  .show()

+-----+-----+
|  x  |  y  |
+-----+-----+
| null| null|
|  1.0|  2.0|
+-----+-----+

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

4楼-- · 2020-03-05 03:33

I would also use Artur's solution, but there is also another way without using javas wrapper classes by using struct:

import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.Row

inputDF
  .withColumn("y",
     udf { (r: Row) => 
       if (r.isNullAt(0)) Some(1) else None
     }.apply(struct($"x"))
  )
  .show()

0人赞添加讨论(0) 举报

How to deal with Spark UDF input/output of primiti

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间