I have this data structure in Spark:
val df = Seq(
("Package 1", Seq("address1", "address2", "address3")),
("Package 2", Seq("address3", "address4", "address5", "address6")),
("Package 3", Seq("address7", "address8")),
("Package 4", Seq("address9")),
("Package 5", Seq("address9", "address1")),
("Package 6", Seq("address10")),
("Package 7", Seq("address8"))).toDF("Package", "Destinations")
df.show(20, false)
I need to find all the addresses that were seen together across different packages. Looks like I can't find a way to efficiently do that. I've tried to group, map, etc. Ideally, result of the given df
would be
+----+------------------------------------------------------------------------+
| Id | Addresses |
+----+------------------------------------------------------------------------+
| 1 | [address1, address2, address3, address4, address5, address6, address9] |
| 2 | [address7, address8] |
| 3 | [address10] |
+----+------------------------------------------------------------------------+