Hive query generating identifiers for a sequence o

Let's say I have the following hive table as input, let's call it connections:

userid  | timestamp   
--------|-------------
1       | 1433258019  
1       | 1433258020
2       | 1433258080
2       | 1433258083
2       | 1433258088
2       | 1433258170
[...]   | [...]

With the following query:

SELECT
    userid,
    timestamp,
    timestamp - LAG(timestamp, 1, 0) OVER w AS timediff
    CASE
      WHEN timediff > 60
      THEN 'new_session'
      ELSE 'same_session'
    END AS session_state
FROM connections
WINDOW w PARTITION BY userid ORDER BY timestamp ASC;

I'm generating the following output:

userid  | timestamp   | timediff   | session_state
--------|-------------|------------|---------------
1       | 1433258019  | 1433258019 | new_session
1       | 1433258020  | 1          | same_session
2       | 1433258080  | 1433258080 | new_session
2       | 1433258083  | 3          | same_session
2       | 1433258088  | 5          | same_session
2       | 1433258170  | 82         | new_session
[...]   | [...]       | [...]      | [...]

How would I do to generate that:

userid  | timestamp   | timediff   | sessionid
--------|-------------|------------------------------
1       | 1433258019  | 1433258019 | user1-session-1
1       | 1433258020  | 1          | user1-session-1
2       | 1433258080  | 1433258080 | user2-session-1
2       | 1433258083  | 3          | user2-session-1
2       | 1433258088  | 5          | user2-session-1
2       | 1433258170  | 82         | user2-session-2
[...]   | [...]       | [...]      | [...]

Is that possible using only HQL and "famous" UDFs (I'd rather not use custom UDFs or reducer scripts) ?

标签： hadoop hive windowing

3条回答

够拽才男人

2楼-- · 2020-04-18 05:01

This works:

SELECT 
  userid,
  timestamp,
  timediff,
  CONCAT(
    'user',
     userid,
     '-',
     'session-',
     CAST(timediff / 60 AS INT) + 1
  ) AS session_id
  FROM (
    SELECT   
      userid,
      timestamp,
      timestamp - LAG(timestamp, 1, timestamp) OVER w AS timediff
    FROM connections
    WINDOW w AS (
      PARTITION BY userid
      ORDER BY timestamp ASC
    )
) a;

OUTPUT:

userid  timestamp   timediff    session_state
1       1433258019  0.0         user1-session-1
1       1433258020  1.0         user1-session-1
2       1433258080  0.0         user2-session-1
2       1433258083  3.0         user2-session-1
2       1433258088  5.0         user2-session-1
2       1433258170  82.0        user2-session-2
3       1433258270  0.0         user3-session-1

you can try something like this if timediff is not required:

select userid,timestamp ,session_count+ concat('user',userid,'-','session-',cast(LAG(session_count-1,1,0) over w1 as string)) AS session_state
--LAG(session_count-1,1,0) over w1 AS session_count_new FROM (select userid, timestamp, timediff, cast (timediff/60 as int)+1 as session_count

0人赞添加讨论(0) 举报

霸刀☆藐视天下

3楼-- · 2020-04-18 05:13

Interesting question. Per your comment to @Madhu, I added the line 2 1433258172 to your example. What you need is to increment every time timediff > 60 is satisfied. The easiest way to do this is to flag it and then cumulatively sum over the window.

Query:

select userid
  , timestamp
  , concat('user', userid, '-session-', s_sum) sessionid
from (
  select *
    , sum( counter ) over (partition by userid
                           order by timestamp asc
                           rows between unbounded preceding and current row) s_sum
  from (
    select *
      , case when timediff > 60 then 1 else 0 end as counter
    from (
      select userid
        , timestamp
        , timestamp - lag(timestamp, 1, 0) over (partition by userid
                                                 order by timestamp asc) timediff
      from connections ) x ) y ) z

Output:

1   1433258019  user1-session-1
1   1433258020  user1-session-1
2   1433258080  user2-session-1
2   1433258083  user2-session-1
2   1433258088  user2-session-1
2   1433258170  user2-session-2
2   1433258172  user2-session-2

0人赞添加讨论(0) 举报

Animai°情兽

4楼-- · 2020-04-18 05:15

Use the following select concat_ws('-',name, city) from employee; the first parameter of concat_ws is separator. name and city are column names for employee table. See that they are of type strings. You can look here for more

0人赞添加讨论(0) 举报

Hive query generating identifiers for a sequence o

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间