I am facing an issue of how to split a multi-value column, i.e. List[String]
, into separate rows.
The initial dataset has following types: Dataset[(Integer, String, Double, scala.List[String])]
+---+--------------------+-------+--------------------+
| id| text | value | properties |
+---+--------------------+-------+--------------------+
| 0|Lorem ipsum dolor...| 1.0|[prp1, prp2, prp3..]|
| 1|Lorem ipsum dolor...| 2.0|[prp4, prp5, prp6..]|
| 2|Lorem ipsum dolor...| 3.0|[prp7, prp8, prp9..]|
The resulting dataset should have following types:
Dataset[(Integer, String, Double, String)]
and the properties
should be split such that:
+---+--------------------+-------+--------------------+
| id| text | value | property |
+---+--------------------+-------+--------------------+
| 0|Lorem ipsum dolor...| 1.0| prp1 |
| 0|Lorem ipsum dolor...| 1.0| prp2 |
| 0|Lorem ipsum dolor...| 1.0| prp3 |
| 1|Lorem ipsum dolor...| 2.0| prp4 |
| 1|Lorem ipsum dolor...| 2.0| prp5 |
| 1|Lorem ipsum dolor...| 2.0| prp6 |
Here's one way to do it:
You can use
explode
:Example:
explode
is often suggested, but it's from the untyped DataFrame API and given you use Dataset, I thinkflatMap
operator might be a better fit (see org.apache.spark.sql.Dataset).You could use it as follows: