This question already has an answer here:
I have a hive
table with two columns with date-time values: start and finish of "session". The following is the sample of such a table:
+----------------------+----------------------+--+
| start_time | end_time |
+----------------------+----------------------+--+
| 2017-01-01 00:24:52 | 2017-01-01 00:25:20 |
| 2017-01-01 00:31:11 | 2017-01-01 10:31:15 |
| 2017-01-01 10:31:15 | 2017-01-01 20:40:53 |
| 2017-01-01 20:40:53 | 2017-01-01 20:40:53 |
| 2017-01-01 10:31:15 | 2017-01-01 10:31:15 |
| 2017-01-01 07:09:34 | 2017-01-01 07:29:00 |
| 2017-01-01 11:36:41 | 2017-01-01 15:32:00 |
| 2017-01-01 07:29:00 | 2017-01-01 07:34:30 |
| 2017-01-01 11:06:30 | 2017-01-01 11:36:41 |
| 2017-01-01 07:45:00 | 2017-01-01 07:50:00 |
+----------------------+----------------------+--+
There are a lot of sessions. I need to get a dataset that presents a number of sessions on half-hour time grid on some interval as following
+----------------------+--------------+--+
| time | sessions_num |
+----------------------+--------------+--+
| 2018-07-04 00:30:00 | 85 |
| 2018-07-04 01:00:00 | 86 |
| 2018-07-04 01:30:00 | 84 |
| 2018-07-04 02:00:00 | 85 |
| 2018-07-04 02:30:00 | 84 |
| 2018-07-04 03:00:00 | 84 |
| 2018-07-04 03:30:00 | 84 |
| 2018-07-04 04:00:00 | 84 |
| 2018-07-04 04:30:00 | 84 |
| 2018-07-04 05:00:00 | 84 |
| 2018-07-04 05:30:00 | 84 |
| 2018-07-04 06:00:00 | 84 |
| 2018-07-04 06:30:00 | 85 |
| 2018-07-04 07:00:00 | 85 |
| 2018-07-04 07:30:00 | 85 |
| 2018-07-04 08:00:00 | 85 |
| 2018-07-04 08:30:00 | 85 |
| 2018-07-04 09:00:00 | 83 |
| 2018-07-04 09:30:00 | 82 |
| 2018-07-04 10:00:00 | 82 |
| 2018-07-04 10:30:00 | 83 |
| 2018-07-04 11:00:00 | 82 |
| 2018-07-04 11:30:00 | 82 |
| 2018-07-04 12:00:00 | 83 |
+----------------------+--------------+--+
What is the Apache Hive
or Apache Spark
or maybe some other way to make last table from first one?
You can do that with the dataframe window function but it will require some preprocessing of your data. Pyspark example:
Now we create a new column which contains a list of items for every 30 minutes of a session. Just imagine a client raises every 30 minutes an event since session beginnig and another one if the last event belongs to a different window (for example begin:2017-01-01 00:24:52 end:2017-01-01 00:25:20 leads to one event while begin:2017-01-01 07:29:00 end:2017-01-01 07:34:30 raises two events):
Now we can 'explode' the toExplode column to create one row for every event:
and finally we can apply the dataframe window function to get the desired result: