How to extract sub-elements from the column of Dat

2019-09-21 06:44发布

问题:

This question already has an answer here:

  • How to transform DataFrame before joining operation? 1 answer

Given the DataFrame like this:

df_products =

+----------+--------------------+
|product_PK|            products|
+----------+--------------------+
|       111|[[222,66],[333,55...|
|       222|[[333,24],[444,77...|
...
+----------+--------------------+

how can I transform it into the following DataFrame:

df_products =

+----------+--------------------+------+
|product_PK|      rec_product_PK|  rank|
+----------+--------------------+------+
|       111|                 222|    66|
|       111|                 333|    55|
|       222|                 333|    24|
|       222|                 444|    77|
...
+----------+--------------------+------+

回答1:

You basically have two steps here: First is exploding the arrays (using the explode functions) to get a row for each value in the array, then fixing each element.

You do not have the schema here so the internal structure of each element in the array is not clear, however, I would assume it is something like a struct with two elements.

This means you would do something like this:

import org.apache.spark.sql.functions.explode
df1 = df.withColumn("array_elem", explode(df("products"))
df2 = df1.select("product_PK", "array_elem.*")

now all you have to do is rename the columns to the names you need.