The data looks like this -
+-----------+-----------+-----------------------------+
| id| point| data|
+-----------------------------------------------------+
| abc| 6|{"key1":"124", "key2": "345"}|
| dfl| 7|{"key1":"777", "key2": "888"}|
| 4bd| 6|{"key1":"111", "key2": "788"}|
I am trying to break it into the following format.
+-----------+-----------+-----------+-----------+
| id| point| key1| key2|
+------------------------------------------------
| abc| 6| 124| 345|
| dfl| 7| 777| 888|
| 4bd| 6| 111| 788|
The explode
function explodes the dataframe into multiple rows. But that is not the desired solution.
Note: This solution does not answers my questions. PySpark "explode" dict in column
As long as you are using Spark version 2.1 or higher,
pyspark.sql.functions.from_json
should get you your desired result, but you would need to first define the requiredschema
which should give you
As suggested by @pault, the data field is a
string
field. since the keys are the same (i.e. 'key1', 'key2') in the JSON string over rows, you might also usejson_tuple()
(this function is New in version 1.6 based on the documentation)Below is My original post: which is most likely WRONG if the original table is from
df.show(truncate=False)
and thus thedata
field is NOT a python data structure.Since you have exploded the data into rows, I supposed the column
data
is a Python data structure instead of a string: