Though problem is of creating multiple group on the basis of two or more column's values of consecutive row, I am just simplifying the problem this way. Suppose have pyspark dataframe like this
>>> df=sqlContext.createDataFrame([
... Row(SN=1,age=45, gender='M', name='Bob'),
... Row(SN=2,age=28, gender='M', name='Albert'),
... Row(SN=3,age=33, gender='F', name='Laura'),
... Row(SN=4,age=43, gender='F', name='Gloria'),
... Row(SN=5,age=18, gender='T', name='Simone'),
... Row(SN=6,age=45, gender='M', name='Alax'),
... Row(SN=7,age=28, gender='M', name='Robert')])
>>> df.show()
+---+---+------+------+
| SN|age|gender| name|
+---+---+------+------+
| 1| 45| M| Bob|
| 2| 28| M|Albert|
| 3| 33| F| Laura|
| 4| 43| F|Gloria|
| 5| 18| T|Simone|
| 6| 45| M| Alax|
| 7| 28| M|Robert|
+---+---+------+------+
Now I want to add "section" column that will have same value if the gender value in consecutive rows are matching, if gender change in next row section value get incremented. So to be precise, I want output like this
+---+---+------+------+-------+
| SN|age|gender| name|section|
+---+---+------+------+-------+
| 1| 45| M| Bob| 1|
| 2| 28| M|Albert| 1|
| 3| 33| F| Laura| 2|
| 4| 43| F|Gloria| 2|
| 5| 18| T|Simone| 3|
| 6| 45| M| Alax| 4|
| 7| 28| M|Robert| 4|
+---+---+------+------+-------+
Unclear if you're looking for Python or Scala solutions, but they would be pretty similar - so here's a Scala solution using Window Functions:
And following is the equivalent
pyspark
code