Hive - multiple (average) count distincts over lay

2019-07-31 01:51发布

Given the following source data (say the table name is user_activity):

+---------+-----------+------------+
| user_id | user_type | some_date  |
+---------+-----------+------------+
| 1       | a         | 2018-01-01 |
| 1       | a         | 2018-01-02 |
| 2       | a         | 2018-01-01 |
| 3       | a         | 2018-01-01 |
| 4       | b         | 2018-01-01 |
| 4       | b         | 2018-01-02 |
| 5       | b         | 2018-01-02 |
+---------+-----------+------------+

I'd like to get the following result:

+-----------+------------+---------------------+
| user_type | user_count | average_daily_users |
+-----------+------------+---------------------+
| a         | 3          | 2                   |
| b         | 2          | 1.5                 |
+-----------+------------+---------------------+

using a single query without multiple subqueries on the same table.


Using multiple queries, I can get:

  • user_count:

    select
      user_type,
      count(distinct user_id)
    from user_activity
    group by user_type
    
  • For average_daily_users:

    select
      user_type,
      avg(distinct_users) as average_daily_users
    from (
      select
        count(distinct user_id) as distinct_users
      from user_activity
      group by user_type, some_date
    )
    group by user_type
    

But I can't seem to write a query that does what I want in one go. I am concerned about the performance impact of multiple subqueries on the same table (it will have to scan through the table twice... right?) I have a rather large data source and would like to minimize running time.

NOTE: The question is titled Hive because that is what I'm working with, but I think it is a generic enough SQL problem so I'm not ruling out answers in other languages.

NOTE2: This question shares details with my other question on partition by columns in window functions (for computing the average daily users column).

1条回答
Evening l夕情丶
2楼-- · 2019-07-31 02:33

This should do what you want:

select ua.user_type,
       count(distinct ua.user_id) as user_count,
       count(distinct some_date || ':' || ua.user_id) / count(distinct some_date)
from user_activity ua
group by ua.user_type;
查看更多
登录 后发表回答