Count the number of sessions if the beginning and

This question already has an answer here:

How to group by time interval in Spark SQL 2 answers

I have a hive table with two columns with date-time values: start and finish of "session". The following is the sample of such a table:

+----------------------+----------------------+--+
| start_time           | end_time             |
+----------------------+----------------------+--+
| 2017-01-01 00:24:52  | 2017-01-01 00:25:20  |
| 2017-01-01 00:31:11  | 2017-01-01 10:31:15  |
| 2017-01-01 10:31:15  | 2017-01-01 20:40:53  |
| 2017-01-01 20:40:53  | 2017-01-01 20:40:53  |
| 2017-01-01 10:31:15  | 2017-01-01 10:31:15  |
| 2017-01-01 07:09:34  | 2017-01-01 07:29:00  |
| 2017-01-01 11:36:41  | 2017-01-01 15:32:00  |
| 2017-01-01 07:29:00  | 2017-01-01 07:34:30  |
| 2017-01-01 11:06:30  | 2017-01-01 11:36:41  |
| 2017-01-01 07:45:00  | 2017-01-01 07:50:00  |
+----------------------+----------------------+--+

There are a lot of sessions. I need to get a dataset that presents a number of sessions on half-hour time grid on some interval as following

+----------------------+--------------+--+
| time                 | sessions_num |
+----------------------+--------------+--+
| 2018-07-04 00:30:00  |          85  |
| 2018-07-04 01:00:00  |          86  |
| 2018-07-04 01:30:00  |          84  |
| 2018-07-04 02:00:00  |          85  |
| 2018-07-04 02:30:00  |          84  |
| 2018-07-04 03:00:00  |          84  |
| 2018-07-04 03:30:00  |          84  |
| 2018-07-04 04:00:00  |          84  |
| 2018-07-04 04:30:00  |          84  |
| 2018-07-04 05:00:00  |          84  |
| 2018-07-04 05:30:00  |          84  |
| 2018-07-04 06:00:00  |          84  |
| 2018-07-04 06:30:00  |          85  |
| 2018-07-04 07:00:00  |          85  |
| 2018-07-04 07:30:00  |          85  |
| 2018-07-04 08:00:00  |          85  |
| 2018-07-04 08:30:00  |          85  |
| 2018-07-04 09:00:00  |          83  |
| 2018-07-04 09:30:00  |          82  |
| 2018-07-04 10:00:00  |          82  |
| 2018-07-04 10:30:00  |          83  |
| 2018-07-04 11:00:00  |          82  |
| 2018-07-04 11:30:00  |          82  |
| 2018-07-04 12:00:00  |          83  |
+----------------------+--------------+--+

What is the Apache Hive or Apache Spark or maybe some other way to make last table from first one?

You can do that with the dataframe window function but it will require some preprocessing of your data. Pyspark example:

#creating example dataframe
from pyspark.sql.functions import to_timestamp
l = [('2017-01-01 00:24:52','2017-01-01 00:25:20')
,('2017-01-01 00:31:11', '2017-01-01 10:31:15')
,('2017-01-01 10:31:15','2017-01-01 20:40:53')
,('2017-01-01 20:40:53','2017-01-01 20:40:53')
,('2017-01-01 10:31:15','2017-01-01 10:31:15')
,('2017-01-01 07:09:34','2017-01-01 07:29:00')
,('2017-01-01 11:36:41','2017-01-01 15:32:00')
,('2017-01-01 07:29:00','2017-01-01 07:34:30'  )
,('2017-01-01 11:06:30','2017-01-01 11:36:41'  )
,('2017-01-01 07:45:00','2017-01-01 07:50:00' )
]
df = spark.createDataFrame(l,['begin','end'])
df = df.select(to_timestamp(df.begin).alias('begin'),to_timestamp(df.end).alias('end'))

Now we create a new column which contains a list of items for every 30 minutes of a session. Just imagine a client raises every 30 minutes an event since session beginnig and another one if the last event belongs to a different window (for example begin:2017-01-01 00:24:52 end:2017-01-01 00:25:20 leads to one event while begin:2017-01-01 07:29:00 end:2017-01-01 07:34:30 raises two events):

from pyspark.sql.functions import window
from pyspark.sql.types import ArrayType,TimestampType
from pyspark.sql.functions import udf, array, explode
from datetime import timedelta

def generateRows(arr):
    li = []
    li.append(arr[0])

    #range(begin,end)
    while (li[-1] + timedelta(minutes=30)) < arr[1]:
        li.append(li[-1]+ timedelta(minutes=30))

    #check if last range item and end belong to different window
    rounded = li[-1] - timedelta(minutes=li[-1].minute % 30, seconds=li[-1].second, microseconds=li[-1].microsecond)

    if (rounded + timedelta(minutes=30)) < arr[1]: 
        li.append(arr[1])

    return li

generateRows_udf = udf(lambda arr: generateRows(arr), ArrayType(TimestampType()))

dftoExplode = df.withColumn('toExplode', generateRows_udf(array(df.begin, df.end)))

Now we can 'explode' the toExplode column to create one row for every event:

df_exploded = dftoExplode.withColumn('EventSessionOpen', explode('toExplode'))
df_exploded = df_exploded.drop(df_exploded.toExplode)

and finally we can apply the dataframe window function to get the desired result:

result = df_exploded.groupBy(window(df_exploded.EventSessionOpen, "30 minutes")).count().orderBy("window")
result.show(truncate=False)

Count the number of sessions if the beginning and

问题:

回答1:

收藏的人(0)

Count the number of sessions if the beginning and

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮