I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings
For Instance I have 2 columns formed from calling collect_set
Fruits | Meat
[Apple,Orange,Pear] [Beef, Chicken, Pork]
How do I turn it into:
Food
[Apple,Orange,Pear, Beef, Chicken, Pork]
Thank you very much for your help in advance
Let's say df
has
+--------------------+--------------------+
| Fruits| Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
then
import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()
creates a set of Fruits
& Meat
combined into one set i.e.
[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]
Hope this helps!
I was also figuring this out in Python, so here is a port of Ramesh's solution to Python:
df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
("Fruits", "Meat"))
df.show(1,False)
from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)
Output:
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
Kudos to Ramesh!
EDIT: Note that you might have to manually specify the column type (not sure why it worked for me only in some cases without explicit type specification - in other cases I was getting a string type column).
from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))
Given that you have dataframe
as
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
You can write a udf
function to merge the sets of two columns into one.
import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)
And then call the udf
function as
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)
You should have your desired final dataframe
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+