Given the following source data (say the table name is user_activity
):
+---------+-----------+------------+
| user_id | user_type | some_date |
+---------+-----------+------------+
| 1 | a | 2018-01-01 |
| 1 | a | 2018-01-02 |
| 2 | a | 2018-01-01 |
| 3 | a | 2018-01-01 |
| 4 | b | 2018-01-01 |
| 4 | b | 2018-01-02 |
| 5 | b | 2018-01-02 |
+---------+-----------+------------+
I'd like to get the following result:
+-----------+------------+---------------------+
| user_type | user_count | average_daily_users |
+-----------+------------+---------------------+
| a | 3 | 2 |
| b | 2 | 1.5 |
+-----------+------------+---------------------+
using a single query without multiple subqueries on the same table.
Using multiple queries, I can get:
user_count
:select user_type, count(distinct user_id) from user_activity group by user_type
For
average_daily_users
:select user_type, avg(distinct_users) as average_daily_users from ( select count(distinct user_id) as distinct_users from user_activity group by user_type, some_date ) group by user_type
But I can't seem to write a query that does what I want in one go. I am concerned about the performance impact of multiple subqueries on the same table (it will have to scan through the table twice... right?) I have a rather large data source and would like to minimize running time.
NOTE: The question is titled Hive because that is what I'm working with, but I think it is a generic enough SQL problem so I'm not ruling out answers in other languages.
NOTE2: This question shares details with my other question on partition by
columns in window functions (for computing the average daily users column).