I have a dataframe that contains the following:
movieId / movieName / genre
1 example1 action|thriller|romance
2 example2 fantastic|action
I would like to obtain a second dataframe (from the first one), that contains the following:
movieId / movieName / genre
1 example1 action
1 example1 thriller
1 example1 romance
2 example2 fantastic
2 example2 action
How I can do that?
I'd use split
standard function.
scala> movies.show(truncate = false)
+-------+---------+-----------------------+
|movieId|movieName|genre |
+-------+---------+-----------------------+
|1 |example1 |action|thriller|romance|
|2 |example2 |fantastic|action |
+-------+---------+-----------------------+
scala> movies.withColumn("genre", explode(split($"genre", "[|]"))).show
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+
// You can use \\| for split instead
scala> movies.withColumn("genre", explode(split($"genre", "\\|"))).show
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+
p.s. You could use Dataset.flatMap
to achieve the same result which is something Scala devs would enjoy more I'm sure.
Using RDD
val df = Seq((1,"example1","action|thriller|romance"),(2,"example2","fantastic|action")).toDF("Id","name","genre")
df.rdd.flatMap( x=>{ val p = x.getAs[String]("genre"); for { a <- p.split("[|]") } yield (x(0),x(1),a)} ).foreach(println)
Results:
(1,example1,action)
(2,example2,fantastic)
(1,example1,thriller)
(2,example2,action)
(1,example1,romance)
To converting back to DF
val rdd1 = df.rdd.flatMap( x=>{ val p = x.getAs[String]("genre"); for { a <- p.split("[|]") } yield Row(x(0),x(1),a)} )
spark.createDataFrame(rdd1,df.schema.copy(Array(StructField("Id",IntegerType),StructField("name",StringType))).add(StructField("genre2",StringType))).show(false)