The current Pyspark dataframe has this structure (a list of WrappedArrays for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[WrappedArray(code5), WrappedArray(code6, code8)] |
+---+---------------------------------------------------------------------+
This is the structure I would like to have (a flattened list for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[code2,code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[code5,code6, code8] |
+---+---------------------------------------------------------------------+
but I'm not sure how to do that transformation. I had tried to do a flatmap but that didn't seem to work. Any suggestions?
Apply a udf which takes the list of list as an input and returns a single list with all the elements. I will post an example if it's not clear. Please tell me if that solves your problem.
You can do this using 2 ways, udf and rdd. Here is example:-
RDD:-
UDF:-