I have data which starts from 1st Jan 2017 to 7th Jan 2017 and it is a week wanted weekly aggregate. I used window function in following manner
val df_v_3 = df_v_2.groupBy(window(col("DateTime"), "7 day"))
.agg(sum("Value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
I am having data in dataframe as
DateTime,value
2017-01-01T00:00:00.000+05:30,1.2
2017-01-01T00:15:00.000+05:30,1.30
--
2017-01-07T23:30:00.000+05:30,1.43
2017-01-07T23:45:00.000+05:30,1.4
I am getting output as :
2016-12-29T05:30:00.000+05:30,2017-01-05T05:30:00.000+05:30,723.87
2017-01-05T05:30:00.000+05:30,2017-01-12T05:30:00.000+05:30,616.74
It shows that my day is starting from 29th Dec 2016 but in actual data is starting from 1 Jan 2017,why this margin is occuring?
The solution with the python API looks a bit more intuitive since the
window
function works with the following options:window(timeColumn, windowDuration, slideDuration=None, startTime=None)
see: https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/functions.htmlNo need for a workaround with
sliding duration
, I used a 3 days "delay" asstartTime
to match the desired tumbling window:For the same result:
For tumbling windows like this it is possible to set an offset to the starting time, more information can be found in the blog here. A sliding window is used, however, by setting both "window duration" and "sliding duration" to the same value, it will be the same as a tumbling window with starting offset.
The syntax is like follows,
With your values I found that an offset of 64 hours would give a starting time of
2017-01-01 00:00:00
.Will give this resulting dataframe: