I am using spark with scala.
Imagine the input:
I would like to know how to get the following output [see the column accumulator on the following image] which should be a Array of type String Array[String]
In my real dataframe I have more than 3 columns. I have several thousand of column.
How can I proceed in order to get my desired output?
You can use an array
function and map a sequence of columns:
import org.apache.spark.sql.functions.{array, col, udf}
val tmp = array(df.columns.map(c => when(col(c) =!= 0, c)):_*)
where
when(col(c) =!= 0, c)
takes a column name if column value is different than zero and null otherwise.
and use an UDF to filter nulls:
val dropNulls = udf((xs: Seq[String]) => xs.flatMap(Option(_)))
df.withColumn("accumulator", dropNulls(tmp))
So with example data:
val df = Seq((1, 0, 1), (0, 1, 1), (1, 0, 0)).toDF("apple", "orange", "kiwi")
you first get:
+-----+------+----+--------------------+
|apple|orange|kiwi| tmp|
+-----+------+----+--------------------+
| 1| 0| 1| [apple, null, kiwi]|
| 0| 1| 1|[null, orange, kiwi]|
| 1| 0| 0| [apple, null, null]|
+-----+------+----+--------------------+
and finally:
+-----+------+----+--------------+
|apple|orange|kiwi| accumulator|
+-----+------+----+--------------+
| 1| 0| 1| [apple, kiwi]|
| 0| 1| 1|[orange, kiwi]|
| 1| 0| 0| [apple]|
+-----+------+----+--------------+