pyspark - merge 2 columns of sets

2020-07-22 04:55发布

问题:

I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings

For Instance I have 2 columns formed from calling collect_set

Fruits                  |    Meat
[Apple,Orange,Pear]          [Beef, Chicken, Pork]

How do I turn it into:

Food

[Apple,Orange,Pear, Beef, Chicken, Pork]

Thank you very much for your help in advance

回答1:

Let's say df has

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+

then

import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()

creates a set of Fruits & Meat combined into one set i.e.

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]


Hope this helps!



回答2:

I was also figuring this out in Python, so here is a port of Ramesh's solution to Python:

df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
                           ("Fruits", "Meat"))
df.show(1,False)

from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)

Output:

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

Kudos to Ramesh!


EDIT: Note that you might have to manually specify the column type (not sure why it worked for me only in some cases without explicit type specification - in other cases I was getting a string type column).

from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))


回答3:

Given that you have dataframe as

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+

You can write a udf function to merge the sets of two columns into one.

import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)

And then call the udf function as

df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)

You should have your desired final dataframe

+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+