How to add a running count to rows in a 'strea

2019-02-14 17:55发布

Thanks to Mike for the suggestion to add the create/insert statements.

create table test (
  pid integer not null,
  date date not null,
  primary key (pid, date)
);

insert into test values
  (1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');

I want to add a new column that is 'days in current streak' so the result would look like:

pid    | date      | in_streak
-------|-----------|----------
1      | 2014-10-1 | 1
1      | 2014-10-2 | 2
1      | 2014-10-3 | 3
1      | 2014-10-5 | 1
1      | 2014-10-7 | 1
2      | 2014-10-2 | 1
2      | 2014-10-3 | 2
2      | 2014-10-4 | 3
2      | 2014-10-6 | 1

I've been trying to use the answers from

but I can't work out how to use the dense_rank() trick with other window functions to get the right result.

2条回答
我欲成王,谁敢阻挡
2楼-- · 2019-02-14 18:23

Building on this table (not using the SQL keyword "date" as column name.):

CREATE TABLE tbl(
  pid int
, the_date date
, PRIMARY KEY (pid, the_date)
);

Query:

SELECT pid, the_date
    , row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
FROM  (
   SELECT *, the_date - '2000-01-01'::date - row_number() 
           OVER (PARTITION BY pid ORDER BY the_date) AS grp
   FROM   tbl
) sub
ORDER  BY pid, the_date;

Subtracting a date from another date yields an integer. Since you are looking for consecutive days, every next row would be greater by one. If we subtract row_number() from that, the whole streak ends up in the same group (grp) per pid. Then it's simple to deal out number per group.

grp is calculated with two subtractions, which should be fastest. An equally fast alternative could be:

the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp

One multiplication, one subtraction. String concatenation and casting is more expensive. Test with EXPLAIN ANALYZE.

Don't forget to partition by pid additionally in both steps, or you'll inadvertently mix groups that should be separated.

Using a subquery, since that is typically faster than a CTE. There is nothing here that a plain subquery couldn't do.

And since you mentioned it: dense_rank() is obviously not necessary here. Basic row_number() does the job.

查看更多
混吃等死
3楼-- · 2019-02-14 18:27

You'll get more attention if you include CREATE TABLE statements and INSERT statements in your question.

create table test (
  pid integer not null,
  date date not null,
  primary key (pid, date)
);

insert into test values
(1,'2014-10-1'), (1,'2014-10-2'), (1,'2014-10-3'), (1,'2014-10-5'),
(1,'2014-10-7'), (2,'2014-10-1'), (2,'2014-10-2'), (2,'2014-10-3'),
(2,'2014-10-5'), (2,'2014-10-7');

The principle is simple. A streak of distinct, consecutive dates minus row_number() is a constant. You can group by the constant, and take the dense_rank() over that result.

with grouped_dates as (
  select pid, date, 
         (date - (row_number() over (partition by pid order by date) || ' days')::interval)::date as grouping_date
  from test
)
select * , dense_rank() over (partition by grouping_date order by date) as in_streak
from grouped_dates
order by pid, date
pid  date         grouping_date  in_streak
--
1    2014-10-01   2014-09-30     1
1    2014-10-02   2014-09-30     2
1    2014-10-03   2014-09-30     3
1    2014-10-05   2014-10-01     1
1    2014-10-07   2014-10-02     1
2    2014-10-01   2014-09-30     1
2    2014-10-02   2014-09-30     2
2    2014-10-03   2014-09-30     3
2    2014-10-05   2014-10-01     1
2    2014-10-07   2014-10-02     1
查看更多
登录 后发表回答