I have the following dataframe showing the revenue of purchases.
+-------+--------+-------+
|user_id|visit_id|revenue|
+-------+--------+-------+
| 1| 1| 0|
| 1| 2| 0|
| 1| 3| 0|
| 1| 4| 100|
| 1| 5| 0|
| 1| 6| 0|
| 1| 7| 200|
| 1| 8| 0|
| 1| 9| 10|
+-------+--------+-------+
Ultimately I want the new column purch_revenue
to show the revenue generated by the purchase in every row.
As a workaround, I have also tried to introduce a purchase identifier purch_id
which is incremented each time a purchase was made. So this is listed just as a reference.
+-------+--------+-------+-------------+--------+
|user_id|visit_id|revenue|purch_revenue|purch_id|
+-------+--------+-------+-------------+--------+
| 1| 1| 0| 100| 1|
| 1| 2| 0| 100| 1|
| 1| 3| 0| 100| 1|
| 1| 4| 100| 100| 1|
| 1| 5| 0| 100| 2|
| 1| 6| 0| 100| 2|
| 1| 7| 200| 100| 2|
| 1| 8| 0| 100| 3|
| 1| 9| 10| 100| 3|
+-------+--------+-------+-------------+--------+
I've tried to use the lag/lead
function like this:
user_timeline = Window.partitionBy("user_id").orderBy("visit_id")
find_rev = fn.when(fn.col("revenue") > 0,fn.col("revenue"))\
.otherwise(fn.lead(fn.col("revenue"), 1).over(user_timeline))
df.withColumn("purch_revenue", find_rev)
This duplicates the revenue column if revenue > 0
and also pulls it up by one row. Clearly, I can chain this for a finite N, but that's not a solution.
- Is there a way to apply this recursively until
revenue > 0
? - Alternatively, is there a way to increment a value based on a condition? I've tried to figure out a way to do that but struggled to find one.