What is the proper way of specifying window interval in Spark SQL, using two predefined boundaries?
I am trying to sum up values from my table over a window of "between 3 hours ago and 2 hours ago".
When I run this query:
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 2 hours preceding and current row
) as sum_value
from my_temp_table;
That works. I get results that I expect, i.e. sums of values that fall into 2 hours rolling window.
Now, what I need is to have that rolling window not being bound to the current row but to take into account rows between 3 hours ago and 2 hours ago. I tried with:
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and 2 hours preceding
) as sum_value
from my_temp_table;
But I get extraneous input 'hours' expecting {'PRECEDING', 'FOLLOWING'}
error.
I also tried with:
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and interval 2 hours preceding
) as sum_value
from my_temp_table;
but then I get different error scala.MatchError: CalendarIntervalType (of class org.apache.spark.sql.types.CalendarIntervalType$)
Third option I tried is:
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and 2 preceding
) as sum_value
from my_temp_table;
and it doesn't work as we would expect: cannot resolve 'RANGE BETWEEN interval 3 hours PRECEDING AND 2 PRECEDING' due to data type mismatch
I am having difficulties finding the docs for interval type as this link doesn't say enough and other information is kinda half baked. At least what I found.