没有与访问数据的表:
uid (INT) | created_at (DATETIME)
我想找到多少天连续用户访问过我们的应用程序。 因此,举例来说:
SELECT DISTINCT DATE(created_at) AS d FROM visits WHERE uid = 123
将返回:
d
------------
2012-04-28
2012-04-29
2012-04-30
2012-05-03
2012-05-04
有5个记录和两个间隔 - 3天(4月28日至三十〇日)和2天(5月3日至四日)。
我的问题是如何找到用户已访问过连续应用(3天中的例子)的天数上限。 试图找到在SQL文档合适的功能,但没有成功。 我缺少的东西吗?
UPD:谢谢你们为您解答! 其实,我和Vertica的分析数据库(http://vertica.com/)的工作,然而,这是一种非常罕见的解决方案,只有少数人有这方面的经验。 虽然它支持SQL-99标准。
嗯,大部分的解决方案的工作,稍作修改。 最后,我创建了自己的查询的版本:
-- returns starts of the vitit series
SELECT t1.d as s FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
s
---------------------
2012-04-28 01:00:00
2012-05-03 01:00:00
-- returns end of the vitit series
SELECT t1.d as f FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
f
---------------------
2012-04-30 01:00:00
2012-05-04 01:00:00
所以,现在只是我们需要做的是通过行索引以某种方式加入他们的行列,例如。
SELECT s, f, DATEDIFF(day, s, f) + 1 as seq FROM (
SELECT t1.d as s, ROW_NUMBER() OVER () as o1 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl1 LEFT JOIN (
SELECT t1.d as f, ROW_NUMBER() OVER () as o2 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl2 ON o1 = o2
输出示例:
s | f | seq
---------------------+---------------------+-----
2012-04-28 01:00:00 | 2012-04-30 01:00:00 | 3
2012-05-03 01:00:00 | 2012-05-04 01:00:00 | 2
Answer 1:
另一种方法,在最短的,做自连接:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select d, group_number, count(d) over m as consecutive_days
from grouped_result
window m as (partition by group_number)
输出:
d | group_number | consecutive_days
---------------------+--------------+------------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)
现场试验: http://www.sqlfiddle.com/#!1/93789/1
SR =第二行,FR =第一行(或者前一行? ツ )。 基本上我们做了一回追踪,它是在不支持数据库模拟滞后LAG
(Postgres的支持LAG,但解决的办法是很长 ,因为窗口不支持嵌套窗口)。 所以在这个查询中,我们使用一种混合的方法,模拟通过LAG加入,然后用反对SUM窗口,这将产生组号
UPDATE
忘了最终的查询,上述查询说明组编号的基础,需要变身的是这个:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select min(d) as starting_date, max(d) as end_date, count(d) as consecutive_days
from grouped_result
group by group_number
-- order by consecutive_days desc limit 1
STARTING_DATE END_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
UPDATE
我知道为什么我的其它解决方案 ,使用窗函数变得长了,它成了长在我试图说明组编号,并通过组计数的逻辑。 如果我切给像我追MySql的方法 ,即窗口函数可以更短。 话虽如此,这是我的老窗函数的方法,虽然现在好了:
with headers as
(
select
d,lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over (order by d) as group_number
from headers
)
select min(d) as starting_date,max(d) as ending_date,count(d) as consecutive_days
from sequence_group
group by group_number
-- order by consecutive_days desc limit 1
现场试验: http://www.sqlfiddle.com/#!1/93789/21
Answer 2:
在MySQL中,你可以这样做:
SET @nextDate = CURRENT_DATE;
SET @RowNum = 1;
SELECT MAX(RowNumber) AS ConecutiveVisits
FROM ( SELECT @RowNum := IF(@NextDate = Created_At, @RowNum + 1, 1) AS RowNumber,
Created_At,
@NextDate := DATE_ADD(Created_At, INTERVAL 1 DAY) AS NextDate
FROM Visits
ORDER BY Created_At
) Visits
这里的例子:
http://sqlfiddle.com/#!2/6e035/8
不过我不是100%肯定这是做到这一点的最好办法。
在PostgreSQL:
;WITH RECURSIVE VisitsCTE AS
( SELECT Created_At, 1 AS ConsecutiveDays
FROM Visits
UNION ALL
SELECT v.Created_At, ConsecutiveDays + 1
FROM Visits v
INNER JOIN VisitsCTE cte
ON 1 + cte.Created_At = v.Created_At
)
SELECT MAX(ConsecutiveDays) AS ConsecutiveDays
FROM VisitsCTE
这里的例子:
http://sqlfiddle.com/#!1/16c90/9
Answer 3:
我知道PostgreSQL有类似公用表表达式为MSSQL可用的东西。 我没那么熟悉PostgreSQL的,但下面的代码适用于MSSQL和你想要做什么。
create table #tempdates (
mydate date
)
insert into #tempdates(mydate) values('2012-04-28')
insert into #tempdates(mydate) values('2012-04-29')
insert into #tempdates(mydate) values('2012-04-30')
insert into #tempdates(mydate) values('2012-05-03')
insert into #tempdates(mydate) values('2012-05-04');
with maxdays (s, e, c)
as
(
select mydate, mydate, 1
from #tempdates
union all
select m.s, mydate, m.c + 1
from #tempdates t
inner join maxdays m on DATEADD(day, -1, t.mydate)=m.e
)
select MIN(o.s),o.e,max(o.c)
from (
select m1.s,max(m1.e) e,max(m1.c) c
from maxdays m1
group by m1.s
) o
group by o.e
drop table #tempdates
而这里的SQL小提琴: http://sqlfiddle.com/#!3/42b38/2
Answer 4:
所有的都很好的答案,但我想我应该说明使用特定于Vertica的分析能力(毕竟它是什么您支付部分)另一种方法作出贡献。 我保证最终查询短。
首先,使用查询conditional_true_event()。 从Vertica的文档:
分配事件画面号为每一行,从0开始,并且当布尔参数表达式的结果判断为真加1的数目。
这个例子查询看起来是这样的:
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits;
输出:
uid created_at seq_id
--- ------------------- ------
123 2012-04-28 00:00:00 0
123 2012-04-29 00:00:00 0
123 2012-04-30 00:00:00 0
123 2012-05-03 00:00:00 1
123 2012-05-04 00:00:00 1
123 2012-06-04 00:00:00 2
123 2012-06-04 00:00:00 2
现在,最终的查询变得容易:
select uid, seq_id, count(1) num_days, min(created_at) s, max(created_at) f
from
(
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits
) as seq
group by uid, seq_id;
最终输出:
uid seq_id num_days s f
--- ------ -------- ------------------- -------------------
123 0 3 2012-04-28 00:00:00 2012-04-30 00:00:00
123 1 2 2012-05-03 00:00:00 2012-05-04 00:00:00
123 2 2 2012-06-04 00:00:00 2012-06-04 00:00:00
最后一点: num_days
实际上是内部查询的行数。 如果有两个'2012-04-28'
的原始表(即重复)访问,您可能要解决这一点。
Answer 5:
下面列出的是甲骨文友好,并且不需要递归的逻辑。
;WITH
visit_dates (
visit_id,
date_id,
group_id
)
AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY TRUNC(created_at)),
TRUNC(SYSDATE) - TRUNC(created_at),
TRUNC(SYSDATE) - TRUNC(created_at) - ROW_NUMBER() OVER (ORDER BY TRUNC(created_at))
FROM
visits
GROUP BY
TRUNC(created_at)
)
,
group_duration (
group_id,
duration
)
AS
(
SELECT
group_id,
MAX(date_id) - MIN(date_id) + 1 AS duration
FROM
visit_dates
GROUP BY
group_id
)
SELECT
MAX(duration) AS max_duration
FROM
group_duration
Answer 6:
PostgreSQL的:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
分而治之的方法:3个步骤
第一步,找到标题:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
select * from headers
输出:
d | header
---------------------+--------
2012-04-28 08:00:00 | t
2012-04-29 08:00:00 | f
2012-04-30 08:00:00 | f
2012-05-03 08:00:00 | t
2012-05-04 08:00:00 | f
(5 rows)
第二步骤中,候分组:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
select * from sequence_group
输出:
d | group_number
---------------------+--------------
2012-04-28 08:00:00 | 1
2012-04-29 08:00:00 | 1
2012-04-30 08:00:00 | 1
2012-05-03 08:00:00 | 2
2012-05-04 08:00:00 | 2
(5 rows)
第三步,计算天数上限:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
输出:
d | group_number | consecutive_count
---------------------+--------------+-----------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)
Answer 7:
这是MySQL,最短的,并使用最少的变量(一个变量只):
select
min(d) as starting_date, max(d) as ending_date,
count(d) as consecutive_days
from
(
select
sr.d,
IF(fr.d is null,@group_number := @group_number + 1,@group_number)
as group_number
from tbl sr
left join tbl fr on sr.d = adddate(fr.d,interval 1 day)
cross join (select @group_number := 0) as grp
) as x
group by group_number
输出:
STARTING_DATE ENDING_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
现场试验: http://www.sqlfiddle.com/#!2/65169/1
Answer 8:
对于PostgreSQL 8.4或更高版本 ,有一个与窗口的功能,没有一个短而干净的方式JOIN
。
我期望这是最快的解决方案至今发布:
WITH x AS (
SELECT created_at AS d
, lag(created_at) OVER (ORDER BY created_at) = (created_at - 1) AS nu
FROM visits
WHERE uid = 1
)
, y AS (
SELECT d, count(NULLIF(nu, TRUE)) OVER (ORDER BY d) AS seq
FROM x
)
SELECT count(*) AS max_days, min(d) AS seq_from, max(d) AS seq_to
FROM y
GROUP BY seq
ORDER BY 1 DESC
LIMIT 1;
返回:
max_days | seq_from | seq_to
---------+------------+-----------
3 | 2012-04-28 | 2012-04-30
假设created_at
是date
和unique
。
在CTE X:每一天,我们的用户访问,请检查是否他昨天来这里了。 要计算“昨天”只是使用created_at - 1
第一行是一个特例,将产生NULL在这里。
在CTE Y:计算运行计数“天无昨天为止”( seq
)每天。 NULL值不计,所以count(NULLIF(nu, TRUE))
是fastes和最短的路,还覆盖的特殊情况。
最后,每组天seq
和度日如年。 虽然是在它我加了序列的第一天和最后一天。 ORDER BY
序列的长度,并挑选最长的一个。
Answer 9:
见状OP的查询方法为他们的Vertica的数据库,我试图使两个联接在同一时间运行:
这些PostgreSQL及SQL Server查询版本应在Vertica的两个工作
PostgreSQL的版本:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
date_part('day', max(gr.d) - min(gr.d))+1 as consecutive_days
from
(
select
cr.d, (row_number() over() - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = cr.d - interval '1 day'
left join tbl nr on nr.d = cr.d + interval '1 day'
where pr.d is null <> nr.d is null
) as gr
group by pair_number
order by start_date
关于pr.d is null <> nr.d is null
。 这意味着,它要么上一行是空或下一行是空的,但他们永远都为空,所以这基本上消除了非连续的日期,如非连续的日期以前和下一行是空值(这基本上给了我们只是页眉和页脚只)的所有日期。 这也被称为一个XOR操作
如果我们只剩下连续的日期,我们现在可以通过ROW_NUMBER它们配对:
(row_number() over() - 1) / 2 as pair_number
row_number()
从1开始,我们需要用1减去它(我们也可以加用1来代替),然后我们除以二; 这使得彼此相邻的成对的日期
现场试验: http://www.sqlfiddle.com/#!1/fc440/7
这是SQL Server版本:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
datediff(day, min(gr.d),max(gr.d)) +1 as consecutive_days
from
(
select
cr.d, (row_number() over(order by cr.d) - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = dateadd(day,-1,cr.d)
left join tbl nr on nr.d = dateadd(day,+1,cr.d)
where
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
) as gr
group by pair_number
order by start_date
相同的逻辑如上,除日期函数人工差异。 和SQL Server需要ORDER BY
其条款OVER
,而PostgreSQL的OVER
可以留空。
SQL Server有没有一流的布尔值,这就是为什么我们不能直接比较布尔值:
pr.d is null <> nr.d is null
我们必须在SQL服务器这样做:
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
现场试验: http://www.sqlfiddle.com/#!3/65df2/17
Answer 10:
目前已经过几次这个问题的答案。 但是,SQL语句都显得过于复杂。 这可以用基本的SQL,这是一种枚举行,有的日期运算来完成。
关键的观察是,如果你有一大堆的天,并有一个整数的并行序列,那么,这种差异的时候,天都在序列中的常量日期。
下面的查询使用此观察回答原来的问题:
select uid, min(d) as startdate, count(*) as numdaysinseq
from
(
select uid, d, adddate(d, interval -offset day) as groupstart
from
(
select uid, d, row_number() over (partition by uid order by date) as offset
from
(
SELECT DISTINCT uid, DATE(created_at) AS d
FROM visits
) t
) t
) t
可惜的是,MySQL不具备row_number()
函数。 然而,有一个变通使用变量(和大多数其他数据库确实有这个功能)。
文章来源: SQL issue - calculate max days sequence