How to create dynamic group in PySpark dataframe?

Though problem is of creating multiple group on the basis of two or more column's values of consecutive row, I am just simplifying the problem this way. Suppose have pyspark dataframe like this

>>> df=sqlContext.createDataFrame([
... Row(SN=1,age=45, gender='M', name='Bob'),
... Row(SN=2,age=28, gender='M', name='Albert'),
... Row(SN=3,age=33, gender='F', name='Laura'),
... Row(SN=4,age=43, gender='F', name='Gloria'),
... Row(SN=5,age=18, gender='T', name='Simone'),
... Row(SN=6,age=45, gender='M', name='Alax'),
... Row(SN=7,age=28, gender='M', name='Robert')])
>>> df.show()

+---+---+------+------+
| SN|age|gender|  name|
+---+---+------+------+
|  1| 45|     M|   Bob|
|  2| 28|     M|Albert|
|  3| 33|     F| Laura|
|  4| 43|     F|Gloria|
|  5| 18|     T|Simone|
|  6| 45|     M|  Alax|
|  7| 28|     M|Robert|
+---+---+------+------+

Now I want to add "section" column that will have same value if the gender value in consecutive rows are matching, if gender change in next row section value get incremented. So to be precise, I want output like this

+---+---+------+------+-------+
| SN|age|gender|  name|section|
+---+---+------+------+-------+
|  1| 45|     M|   Bob|      1|
|  2| 28|     M|Albert|      1|
|  3| 33|     F| Laura|      2|
|  4| 43|     F|Gloria|      2|
|  5| 18|     T|Simone|      3|
|  6| 45|     M|  Alax|      4|
|  7| 28|     M|Robert|      4|
+---+---+------+------+-------+

标签： scala group-by pyspark apache-spark-sql rdd

1条回答

Bombasti

2楼-- · 2019-07-16 13:49

Unclear if you're looking for Python or Scala solutions, but they would be pretty similar - so here's a Scala solution using Window Functions:

import spark.implicits._
import functions._

// we'll use this window to attach the "previous" gender to each record
val globalWindow = Window.orderBy("SN")

// we'll use this window to compute "cumulative sum" of 
// an indicator column that would be 1 only if gender changed
val upToThisRowWindow = globalWindow.rowsBetween(Long.MinValue, 0)

val result = df
  .withColumn("prevGender", lag("gender", 1) over globalWindow) // add previous record's gender
  .withColumn("shouldIncrease", when($"prevGender" =!= $"gender", 1) otherwise 0) // translate to 1 or 0
  .withColumn("section", (sum("shouldIncrease") over upToThisRowWindow) + lit(1)) // cumulative sum
  .drop("prevGender", "shouldIncrease") // drop helper columns

result.show()
// +---+---+------+------+-------+
// | SN|age|gender|  name|section|
// +---+---+------+------+-------+
// |  1| 45|     M|   Bob|      1|
// |  2| 28|     M|Albert|      1|
// |  3| 33|     F| Laura|      2|
// |  4| 43|     F|Gloria|      2|
// |  5| 18|     T|Simone|      3|
// |  6| 45|     M|  Alax|      4|
// |  7| 28|     M|Robert|      4|
// +---+---+------+------+-------+

And following is the equivalent pyspark code

from pyspark.sql import Window as W
import sys
globalWindow = W.orderBy("SN")
upToThisRowWindow = globalWindow.rowsBetween(-sys.maxsize-1, 0)
from pyspark.sql import functions as F
df.withColumn("section", F.sum(F.when(F.lag("gender", 1).over(globalWindow) != df.gender, 1).otherwise(0)).over(upToThisRowWindow)+1).show()

0人赞添加讨论(0) 举报

How to create dynamic group in PySpark dataframe?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间