Hive - multiple (average) count distincts over lay

Given the following source data (say the table name is user_activity):

+---------+-----------+------------+
| user_id | user_type | some_date  |
+---------+-----------+------------+
| 1       | a         | 2018-01-01 |
| 1       | a         | 2018-01-02 |
| 2       | a         | 2018-01-01 |
| 3       | a         | 2018-01-01 |
| 4       | b         | 2018-01-01 |
| 4       | b         | 2018-01-02 |
| 5       | b         | 2018-01-02 |
+---------+-----------+------------+

I'd like to get the following result:

+-----------+------------+---------------------+
| user_type | user_count | average_daily_users |
+-----------+------------+---------------------+
| a         | 3          | 2                   |
| b         | 2          | 1.5                 |
+-----------+------------+---------------------+

using a single query without multiple subqueries on the same table.

Using multiple queries, I can get:

user_count:

select
  user_type,
  count(distinct user_id)
from user_activity
group by user_type

For average_daily_users:

select
  user_type,
  avg(distinct_users) as average_daily_users
from (
  select
    count(distinct user_id) as distinct_users
  from user_activity
  group by user_type, some_date
)
group by user_type

But I can't seem to write a query that does what I want in one go. I am concerned about the performance impact of multiple subqueries on the same table (it will have to scan through the table twice... right?) I have a rather large data source and would like to minimize running time.

NOTE: The question is titled Hive because that is what I'm working with, but I think it is a generic enough SQL problem so I'm not ruling out answers in other languages.

NOTE2: This question shares details with my other question on partition by columns in window functions (for computing the average daily users column).

标签： sql hive group-by hiveql

1条回答

Evening l夕情丶

2楼-- · 2019-07-31 02:33

This should do what you want:

select ua.user_type,
       count(distinct ua.user_id) as user_count,
       count(distinct some_date || ':' || ua.user_id) / count(distinct some_date)
from user_activity ua
group by ua.user_type;

0人赞添加讨论(0) 举报

Hive - multiple (average) count distincts over lay

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间