I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance,
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
should become
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
This is my code
private DataFrame explodeDataFrame(DataFrame df) {
DataFrame resultDf = df;
for (StructField field : df.schema().fields()) {
if (field.dataType() instanceof ArrayType) {
resultDf = resultDf.withColumn(field.name(), org.apache.spark.sql.functions.explode(resultDf.col(field.name())));
resultDf.show();
}
}
return resultDf;
}
The problem is that in my data, some of the array columns have nulls. In that case, the entire row is deleted. So this dataframe:
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
2 | Lucy | null
becomes
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
instead of
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
How can I explode my arrays so that I don't lose the null rows?
I am using Spark 1.5.2 and Java 8
Following up on the accepted answer, when the array elements are a complex type it can be difficult to define it by hand (e.g with large structs).
To do it automatically I wrote the following helper method:
Edit: it seems that spark 2.2 and newer have this built in.
Spark 2.2+
You can use
explode_outer
function:Spark <= 2.1
In Scala but Java equivalent should be almost identical (to import individual functions use
import static
).The idea here is basically to replace
NULL
with anarray(NULL)
of a desired type. For complex type (a.k.astructs
) you have to provide full schema:or
Note:
If array
Column
has been created withcontainsNull
set tofalse
you should change this first (tested with Spark 2.1):You can use
explode_outer()
function.