Python Spark Cumulative Sum by Group Using DataFra

2020-01-29 05:29发布

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark?

With an example dataset as follows:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], 
                                 ["time", "value", "class"] )

+----+-----+-----+
|time|value|class|
+----+-----+-----+
|   1|    2|    a|
|   3|    2|    a|
|   1|    3|    b|
|   2|    2|    a|
|   2|    3|    b|
+----+-----+-----+

I would like to add a cumulative sum column of value for each class grouping over the (ordered) time variable.

标签： apache-spark pyspark spark-dataframe

1条回答

forever°为你锁心

2楼-- · 2020-01-29 06:07

This can be done using a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:

from pyspark.sql import Window
from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')
             .rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()

+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
|   1|    3|    b|      3|
|   2|    3|    b|      6|
|   1|    2|    a|      2|
|   2|    2|    a|      4|
|   3|    2|    a|      6|
+----+-----+-----+-------+

0人赞添加讨论(0) 举报

Python Spark Cumulative Sum by Group Using DataFra

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间