I have a pyspark dataframe where I have grouped data to list with collect_list.
from pyspark.sql.functions import udf, collect_list
from itertools import combinations, chain
#Create Dataframe
df = spark.createDataFrame( [(1,'a'), (1,'b'), (2,'c')] , ["id", "colA"])
df.show()
>>>
+---+----+
| id|colA|
+---+----+
| 1| a|
| 1| b|
| 2| c|
+---+----+
#Group by and collect to list
df = df.groupBy(df.id).agg(collect_list("colA").alias("colAlist"))
df.show()
>>>
+---+--------+
| id|colAList|
+---+--------+
| 1| [a, b]|
| 2| [c]|
+---+--------+
Then I use a function to find all combinations of the list elements to a new list
allsubsets = lambda l: list(chain(*[combinations(l , n) for n in range(1,len(l)+1)]))
df = df.withColumn('colAsubsets',udf(allsubsets)(df['colAList']))
so I would excpect something like
+---+--------------------+
| id| colAsubsets |
+---+--------------------+
| 1| [[a], [b], [a,b]] |
| 2| [[b]] |
+---+--------------------+
but I get:
df.show()
>>>
+---+--------+-----------------------------------------------------------------------------------------+
|id |colAList|colAsubsets |
+---+--------+-----------------------------------------------------------------------------------------+
|1 |[a, b] |[[Ljava.lang.Object;@75e2d657, [Ljava.lang.Object;@7f662637, [Ljava.lang.Object;@b572639]|
|2 |[c] |[[Ljava.lang.Object;@26f67148] |
+---+--------+-----------------------------------------------------------------------------------------+
Any ideas what to do? And then maybe how to flatten the list to different rows?
Improving on @RameshMaharjan answer, in order to flatten the list to different rows:
You have to use explode on an array. You must before specify the type of your udf so it doesn't return a StringType.
Result :
All you need to do is to extract the elements from objects created by
chain
andcombinations
in a flattened wayso changing
to the following
should give you
I hope the answer is helpful