I'm trying to find the # active users over time on a daily basis.
A user is active when he has made more than 10 requests per week for 4 consecutive weeks.
ie. On Oct 31, 2014, a user is active if he has made more than 10 requests in total per week between:
- Oct 24-Oct 30, 2014 AND
- Oct 17-Oct 23, 2014 AND
- Oct 10-Oct 16, 2014 AND
- Oct 3-Oct 9, 2014
I have a table of requests
:
CREATE TABLE requests (
id text PRIMARY KEY, -- id of the request
amount bigint, -- sum of requests made by accounts_id to recipient_id,
-- aggregated on a daily basis based on "date"
accounts_id text, -- id of the user
recipient_id text, -- id of the recipient
date timestamp -- date that the request was made in YYYY-MM-DD
);
Sample values:
INSERT INTO requests2
VALUES
('1', 19, 'a1', 'b1', '2014-10-05 00:00:00'),
('2', 19, 'a2', 'b2', '2014-10-06 00:00:00'),
('3', 85, 'a3', 'b3', '2014-10-07 00:00:00'),
('4', 11, 'a1', 'b4', '2014-10-13 00:00:00'),
('5', 2, 'a2', 'b5', '2014-10-14 00:00:00'),
('6', 50, 'a3', 'b5', '2014-10-15 00:00:00'),
('7', 787323, 'a1', 'b6', '2014-10-17 00:00:00'),
('8', 33, 'a2', 'b8', '2014-10-18 00:00:00'),
('9', 14, 'a3', 'b9', '2014-10-19 00:00:00'),
('10', 11, 'a4', 'b10', '2014-10-19 00:00:00'),
('11', 1628, 'a1', 'b11', '2014-10-25 00:00:00'),
('13', 101, 'a2', 'b11', '2014-10-25 00:00:00');
Example output:
Date | # Active users
-----------+---------------
10-01-2014 | 600
10-02-2014 | 703
10-03-2014 | 891
Here's what I tried to do to find the number of active users for a certain date (e.g. 10-01-2014):
SELECT count(*)
FROM
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '2 weeks' AND '2014-10-01'::date - interval '1 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '3 weeks' AND '2014-10-01'::date - interval '2 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '4 weeks' AND '2014-10-01'::date - interval '3 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '5 weeks' AND '2014-10-01'::date - interval '4 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
Since this is just the query to get the number for 1 day, I need to get this number on a daily basis over time. I think the idea is to do a join to get the date so I tried to do something like this:
SELECT week_1."Date_series",
count(*)
FROM
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '2 weeks' AND requests.date::date - interval '1 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '3 weeks' AND requests.date::date - interval '2 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
AND week_1."Date_series" = week_2."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '4 weeks' AND requests.date::date - interval '3 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
AND week_2."Date_series" = week_3."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '5 weeks' AND requests.date::date - interval '4 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
AND week_3."Date_series" = week_4."Date_series"
GROUP BY week_1."Date_series"
However, I think I'm not getting the right answer and I'm not sure why. Any tips/ guidance/ pointers is much appreciated! :) :)
PS. I'm using Postgres 9.3
Here is a long answer how to make your queries short. :)
Table
Building on my table (before you provided table definition with different (odd!) data types:
Active Users for given day
The list of "active users" for one given day:
Step 1
In the innermost subquery
w
(for "week") build the bounds of the 4 weeks of interest from aCROSS JOIN
of the given day - 1 with the output ofgenerate_series(0-3)
.To add / subtract days to / from a
date
(not from a timestamp!) just add / subtractinteger
numbers. The expressionday - 7 * w
subtracts 0-3 times 7 days from the given date, arriving at the end dates for each week (w_end
).Subrtract another 6 days (not 7!) from each to compute the respective start (
w_start
).Additionally, keep the week number
w
(0-3) for the later aggregation.Step 2
In subquery
sub
join rows fromrequests
to the set of 4 weeks, where the date lies between start and end date.GROUP BY
the week numberw
and theaccounts_id
.Only weeks with more than 10 requests total qualify.
Step 3
In the outer
SELECT
count the number of weeks each user (accounts_id
) qualified. Must be 4 to qualify as "active user"Count of active users per day
This is dynamite.
Wrapped in in a simple SQL function to simplify general use, but the query can be used on its own just as well:
The function takes any day (
_now
), "today" by default, and the number of days (_days
) in the result, 3 by default. Call:Or without parameters to use defaults:
The approach is different from the first query.
SQL Fiddle with both queries and variants for your table definition.
Step 0
In the CTE
r
pre-aggregate amounts per(accounts_id, date)
for only the period of interest, for better performance. The table is only scanned once, the suggested index (see blow) will kick in here.Step 1
In the inner subquery
d
generate the necessary list of days:27 + _days
rows, where_days
is the desired number of rows in the output, effectively 28 days or more.While being at it, compute the day of the week (
dow
) to be used for aggregating in step 3.i%7
coincides with weekly intervals, the query works for any interval, though.In the inner subquery
a
generate a unique list of users (accounts_id
) that exist in CTEr
and pass some first superficial tests (sufficient rows spanning sufficient time with sufficient total requests).Step 2
Generate a Cartesian product from
d
anda
with aCROSS JOIN
to have one row for every relevant day for every relevant user.LEFT JOIN
tor
to append the amount of requests (if any). NoWHERE
condition, we want every day in the result, even if there are no active users at all.Compute the total amount for the past week (
w_amount
) in the same step using a Window functions with a custom frame. Example:Step 3
Cut off the last 6 days now; which is optional and may or may not help performance. Test it:
WHERE date >= _now - (21 + _days)
Count the weeks where the minimum amount is met (
w_ct
) in a similar window function, this time partitioned bydow
additionally to only have same weekdays for the past 4 weeks in the frame (which carry the sum of the respective past week). The expressioncount(w_amount > 10 OR NULL)
only counts rows with more than 10 requests. Detailed explanation:Step 4
In the outer
SELECT
group bydate
and count users that passed all 4 weeks (count(w_ct = 4 OR NULL)
). Add 1 to the date to compensate off-by-1,ORDER
andLIMIT
to the requested number of days.Performance and outlook
The perfect index for both queries would be:
Performance should be good, but get even (much) better with the upcoming Postgres 9.4, due to the new moving aggregate support:
Moving-aggregate support in the Postgres Wiki.
Moving aggregates in the 9.4 manual
Aside: don't call a
timestamp
column "date", it's atimestamp
, not adate
. Better yet, never use basic type names likedate
ortimestamp
as identifier. Ever.