I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.
I thought I could do it like so:
val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )
However, this results in the following exception:
java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)
Apparently array types are not supported by the when
function. Is there some other easy way to convert the null values?
In case it is relevant, here is the schema for this column:
|-- myCol: array (nullable = true)
| |-- element: integer (containsNull = false)
You can use an UDF:
import org.apache.spark.sql.functions.udf
val array_ = udf(() => Array.empty[Int])
combined with WHEN
or COALESCE
:
df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show
In the recent versions you can use array
function:
import org.apache.spark.sql.functions.{array, lit}
df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show
Please note that it will work only if conversion from string
to the desired type is allowed.
The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
def empty_array(t):
return udf(lambda: [], ArrayType(t()))()
coalesce(myCol, empty_array(IntegerType()))
and in the recent versions just use array
:
from pyspark.sql.functions import array
coalesce(myCol, array().cast("array<integer>"))
With a slight modification to zero323's approach, I was able to do this without using a udf in Spark 2.3.1.
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| null|
| c|[7, 8, 9]|
+---+---------+
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| []|
| c|[7, 8, 9]|
+---+---------+
An UDF-free alternative to use when the data type you want your array elements in can not be cast from StringType
is the following:
import pyspark.sql.types as T
import pyspark.sql.functions as F
df.withColumn(
"myCol",
F.coalesce(
F.col("myCol"),
F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
)
)
You can replace IntegerType()
with whichever data type, also complex ones.