Here is my exact requirement. I have to add a new column named ("DAYS_TO_NEXT_PD_ENCOUNTER"). As the name indicates, the values in the new column should have a difference of RANK that has claim_typ as 'PD' and the current row. For one ID, it can occur in-between any of the 'RV's and 'RJ's. For the rows that are present after the first occurence of claim_typ as 'PD', the difference should be null as shown below:
The API 'last' works if the clm_typ 'PD' occurs as the last element. It will not be the case always. For one ID, it can occur in-between any of the 'RV's and 'RJ's.
+----------+--------+---------+----+-------------------------+
| ID | WEEK_ID|CLAIM_TYP|RANK|DAYS_TO_NEXT_PD_ENCOUNTER|
+----------+--------+---------+----+-------------------------+
| 30641314|20180209| RV| 1| 5|
| 30641314|20180209| RJ| 2| 4|
| 30641314|20180216| RJ| 3| 3|
| 30641314|20180216| RJ| 4| 2|
| 30641314|20180216| RJ| 5| 1|
| 30641314|20180216| PD| 6| 0|
| 48115882|20180209| RV| 1| 3|
| 48115882|20180209| RV| 2| 2|
| 48115882|20180209| RV| 3| 1|
| 48115882|20180209| PD| 4| 0|
| 48115882|20180216| RJ| 5| null|
| 48115882|20180302| RJ| 6| null|
+----------+--------+---------+----+-------------------------+
Shown here is a PySpark solution.
You can use conditional aggregation with
max(when...))
to get the necessary difference of ranks with the first 'PD' row. After getting the difference, use awhen...
tonull
out rows with negative ranks as they all occur after the first 'PD' row.