I try to implement a Cumululative product in Spark scala but I really don't know how to it. I have the following dataframe:
Input data:
+--+--+--------+----+
|A |B | date | val|
+--+--+--------+----+
|rr|gg|20171103| 2 |
|hh|jj|20171103| 3 |
|rr|gg|20171104| 4 |
|hh|jj|20171104| 5 |
|rr|gg|20171105| 6 |
|hh|jj|20171105| 7 |
+-------+------+----+
And I would like to have the following output
Output data:
+--+--+--------+-----+
|A |B | date | val |
+--+--+--------+-----+
|rr|gg|20171105| 48 | // 2 * 4 * 6
|hh|jj|20171105| 105 | // 3 * 5 * 7
+-------+------+-----+
If you have any idea about how to do it, it would be really helpful :)
Thank a lot
You can solve this using either collect_list+UDF or an UDAF. UDAF may be more efficient, but harder to implement due to the local aggregation.
If you have a dataframe like this :
You can invoke an UDF :
As long as the number are strictly positive (0 can be handled as well, if present, using
coalesce
) as in your example, the simplest solution is to compute the sum of logarithms and take the exponential:Since this uses FP arithmetic the result won't be exact:
but after rounding should good enough for majority of applications.
If that's not enough you can define an
UserDefinedAggregateFunction
orAggregator
(How to define and use a User-Defined Aggregate Function in Spark SQL?) or use functional API withreduceGroups
: