How to filter nullable Array-Elements in Spark 1.6

2019-03-03 18:40发布

问题:

Consider the following DataFrame

root
 |-- values: array (nullable = true)
 |    |-- element: double (containsNull = true)

with content:

+-----------+
|     values|
+-----------+
|[1.0, null]|
+-----------+

Now I want to pass thie value column to an UDF:

val inspect = udf((data:Seq[Double]) => {
  data.foreach(println)
  println()
  data.foreach(d => println(d))
  println()
  data.foreach(d => println(d==null))
  ""
})

df.withColumn("dummy",inspect($"values"))

I'm really confused from the output of the above println statements:

1.0
null

1.0
0.0

false
false

My questions:

  1. Why is foreach(println) not giving the same output as foreach(d=>println(d))?
  2. How can the Double be null in the first println-statement, I thought scala's Double cannot be null?
  3. How can I filter null values in my Seq other han filtering 0.0 which isnt really safe? Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls? (this works, but I'm unsure if that is the way to go)

Note that I'm aware of this Question, but my question is specific to array-type columns.

回答1:

Why is foreach(println) not giving the same output as foreach(d=>println(d))?

In the context where Any is expected data cast is skipped completely. This is explained in detail in If an Int can't be null, what does null.asInstanceOf[Int] mean?

How can the Double be null in the first println-statement, I thought scala's Double cannot be null?

Internal binary representation doesn't use Scala types at all. Once array data is decoded it is represented as an Array[Any] and elements are coerced to a declared type with simple asInstanceOf.

Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls?

In general if values are nullable then you should use external type which is nullable as well or Option. Unfortunately only the first option is applicable for UDFs.