There was a question regarding this issue here:
Explode (transpose?) multiple columns in Spark SQL table
Suppose that we have extra columns as below:
**userId someString varA varB varC varD**
1 "example1" [0,2,5] [1,2,9] [a,b,c] [red,green,yellow]
2 "example2" [1,20,5] [9,null,6] [d,e,f] [white,black,cyan]
To conclude an output like below:
userId someString varA varB varC varD
1 "example1" 0 1 a red
1 "example1" 2 2 b green
1 "example1" 5 9 c yellow
2 "example2" 1 9 d white
2 "example2" 20 null e black
2 "example2" 5 6 f Cyan
The answer was by defining a udf
as:
val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
and defining "withColumn".
df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
$"userId", $"someString",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show
If we need to extend the above answer, with more columns, what is the easiest way to amend the above code. Any help please.
The approach with the
zip
udf seems ok, but you need to extend if for more collections. Unfortunately there is no really nice way to zip 4 Seqs, but this should work:I am assuming that the size of varA,varB,varC,varD remains same from your example.
If the size of columns varA,varB,varC or varD is different then those scenarios need to be handles.
You could iterate over the max size and output null values if values are not present in any columns by handling exceptions.