I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.
I thought I could do it like so:
val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )
However, this results in the following exception:
java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)
Apparently array types are not supported by the when
function. Is there some other easy way to convert the null values?
In case it is relevant, here is the schema for this column:
|-- myCol: array (nullable = true)
| |-- element: integer (containsNull = false)
You can use an UDF:
combined with
WHEN
orCOALESCE
:In the recent versions you can use
array
function:Please note that it will work only if conversion from
string
to the desired type is allowed.The same thing can be of course done in PySpark as well. For the legacy solutions you can define
udf
and in the recent versions just use
array
:With a slight modification to zero323's approach, I was able to do this without using a udf in Spark 2.3.1.
An UDF-free alternative to use when the data type you want your array elements in can not be cast from
StringType
is the following:You can replace
IntegerType()
with whichever data type, also complex ones.