I want to take following example to do my aggregation according to 'states' collected by collect_list.
example code:
states = sc.parallelize(["TX","TX","CA","TX","CA"])
states.map(lambda x:(x,1)).reduceByKey(operator.add).collect()
#printed output: [('TX', 3), ('CA', 2)]
my code:
from pyspark import SparkContext,SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import collect_list
import operator
conf = SparkConf().setMaster("local")
conf = conf.setAppName("test")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
rdd = sc.parallelize([('20170901',['TX','TX','CA','TX']), ('20170902', ['TX','CA','CA']), ('20170902',['TX']) ])
df = spark.createDataFrame(rdd, ["datatime", "actionlist"])
df = df.groupBy("datatime").agg(collect_list("actionlist").alias("actionlist"))
rdd = df.select("actionlist").rdd.map(lambda x:(x,1))#.reduceByKey(operator.add)
print (rdd.take(2))
#printed output: [(Row(actionlist=[['TX', 'CA', 'CA'], ['TX']]), 1 (Row(actionlist=[['TX', 'TX', 'CA', 'TX']]), 1)]
#for next step, it should look like:
#[(Row(actionlist=[('TX',1), ('CA',1), ('CA',1), ('TX',1)]), (Row(actionlist=[('TX',1), ('TX',1), ('CA',1), ('TX',1)])]
what I want is something like:
20170901,[('TX', 3), ('CA', 1 )]
20170902,[('TX', 2), ('CA', 2 )]
I think first step is to flatten collect_list result, I've tried: udf(lambda x: list(chain.from_iterable(x)), StringType()) udf(lambda items: list(chain.from_iterable(itertools.repeat(x,1) if isinstance(x,str) else x for x in items))) udf(lambda l: [item for sublist in l for item in sublist])
but no luck yet, next step is to makeup KV pairs and do reduce, I stuck here for a while, can any spark expert help on the logic? appreciate you help!
You can use reduce and counter in udf to achieve it. I tried my way, hope this helps.
You can simply do ,it by using combineByKey():