Spark Window Functions That depend on itself

2019-02-05 15:28发布

问题:

Say I have a column of sorted timestamps in a DataFrame. I want to write a function that adds a column to this DataFrame that cuts the timestamps into sequential time slices according to the following rules:

  • start at the first row and keep iterating down to the end
  • for each row, if you've walked n number of rows in the current group OR you have walked more than time interval t in the current group, make a cut
  • return a new column with the group assignment for each row, which should be an increasing integer

In English: each group should be no more than n rows, and should not span more than t time

For example: (Using integers for timestamps to simplify)

INPUT

     time
---------
        1
        2
        3
        5
       10
      100
     2000
     2001
     2002
     2003

OUTPUT (after slice function with n = 3 and t = 5)

     time | group
----------|------
        1 |     1
        2 |     1
        3 |     1
        5 |     2 // cut because there were no cuts in the last 3 rows
       10 |     2
      100 |     3 // cut because 100 - 5 > 5
     2000 |     4 // cut because 2000 - 100 > 5
     2001 |     4
     2002 |     4
     2003 |     5 // cut because there were no cuts in the last 3 rows

I have a feeling this can be done with window functions in Spark. Afterall, window functions were created to help developers compute moving averages. You'd basically calculate an aggregate (in this case average) of a column (stock price) per window of n rows.

The same should be able to be accomplished here. For each row, if the last n rows contains no cut, or the timespan between the last cut and the current timestamp is greater than t, cut = true, o.w. cut = false. But what I can't seem to figure out is how to make the Window Function aware of itself. That would be like the moving average of a particular row aware of the last moving average.