可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of records, but only one record per day per user. If the user has not accessed the website for that day, no record will be generated.
Id UserId CreationDate
------ ------ ------------
750997 12 2009-07-07 18:42:20.723
750998 15 2009-07-07 18:42:20.927
751000 19 2009-07-07 18:42:22.283
What I'm looking for is a SQL query on this table with good performance, that tells me which userids have accessed the website for (n) continuous days without missing a day.
In other words, how many users have (n) records in this table with sequential (day-before, or day-after) dates? If any day is missing from the sequence, the sequence is broken and should restart again at 1; we're looking for users who have achieved a continuous number of days here with no gaps.
Any resemblance between this query and a particular Stack Overflow badge is purely coincidental, of course.. :)
回答1:
The answer is obviously:
SELECT DISTINCT UserId
FROM UserHistory uh1
WHERE (
SELECT COUNT(*)
FROM UserHistory uh2
WHERE uh2.CreationDate
BETWEEN uh1.CreationDate AND DATEADD(d, @days, uh1.CreationDate)
) = @days OR UserId = 52551
EDIT:
Okay here's my serious answer:
DECLARE @days int
DECLARE @seconds bigint
SET @days = 30
SET @seconds = (@days * 24 * 60 * 60) - 1
SELECT DISTINCT UserId
FROM (
SELECT uh1.UserId, Count(uh1.Id) as Conseq
FROM UserHistory uh1
INNER JOIN UserHistory uh2 ON uh2.CreationDate
BETWEEN uh1.CreationDate AND
DATEADD(s, @seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))
AND uh1.UserId = uh2.UserId
GROUP BY uh1.Id, uh1.UserId
) as Tbl
WHERE Conseq >= @days
EDIT:
[Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!
回答2:
How about (and please make sure the previous statement ended with a semi-colon):
WITH numberedrows
AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID
ORDER BY CreationDate)
- DATEDIFF(day,'19000101',CreationDate) AS TheOffset,
CreationDate,
UserID
FROM tablename)
SELECT MIN(CreationDate),
MAX(CreationDate),
COUNT(*) AS NumConsecutiveDays,
UserID
FROM numberedrows
GROUP BY UserID,
TheOffset
The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.
You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...
I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.
...and would be very much helped by an index on tablename(UserID, CreationDate)
Edited: Turns out Offset is a reserved word, so I used TheOffset instead.
Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.
Rob
回答3:
If you can change the table schema, I'd suggest adding a column LongestStreak
to the table which you'd set to the number of sequential days ending to the CreationDate
. It's easy to update the table at login time (similar to what you are doing already, if no rows exist of the current day, you'll check if any row exists for the previous day. If true, you'll increment the LongestStreak
in the new row, otherwise, you'll set it to 1.)
The query will be obvious after adding this column:
if exists(select * from table
where LongestStreak >= 30 and UserId = @UserId)
-- award the Woot badge.
回答4:
Some nicely expressive SQL along the lines of:
select
userId,
dbo.MaxConsecutiveDates(CreationDate) as blah
from
dbo.Logins
group by
userId
Assuming you have a user defined aggregate function something along the lines of (beware this is buggy):
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Runtime.InteropServices;
namespace SqlServerProject1
{
[StructLayout(LayoutKind.Sequential)]
[Serializable]
internal struct MaxConsecutiveState
{
public int CurrentSequentialDays;
public int MaxSequentialDays;
public SqlDateTime LastDate;
}
[Serializable]
[SqlUserDefinedAggregate(
Format.Native,
IsInvariantToNulls = true, //optimizer property
IsInvariantToDuplicates = false, //optimizer property
IsInvariantToOrder = false) //optimizer property
]
[StructLayout(LayoutKind.Sequential)]
public class MaxConsecutiveDates
{
/// <summary>
/// The variable that holds the intermediate result of the concatenation
/// </summary>
private MaxConsecutiveState _intermediateResult;
/// <summary>
/// Initialize the internal data structures
/// </summary>
public void Init()
{
_intermediateResult = new MaxConsecutiveState { LastDate = SqlDateTime.MinValue, CurrentSequentialDays = 0, MaxSequentialDays = 0 };
}
/// <summary>
/// Accumulate the next value, not if the value is null
/// </summary>
/// <param name="value"></param>
public void Accumulate(SqlDateTime value)
{
if (value.IsNull)
{
return;
}
int sequentialDays = _intermediateResult.CurrentSequentialDays;
int maxSequentialDays = _intermediateResult.MaxSequentialDays;
DateTime currentDate = value.Value.Date;
if (currentDate.AddDays(-1).Equals(new DateTime(_intermediateResult.LastDate.TimeTicks)))
sequentialDays++;
else
{
maxSequentialDays = Math.Max(sequentialDays, maxSequentialDays);
sequentialDays = 1;
}
_intermediateResult = new MaxConsecutiveState
{
CurrentSequentialDays = sequentialDays,
LastDate = currentDate,
MaxSequentialDays = maxSequentialDays
};
}
/// <summary>
/// Merge the partially computed aggregate with this aggregate.
/// </summary>
/// <param name="other"></param>
public void Merge(MaxConsecutiveDates other)
{
// add stuff for two separate calculations
}
/// <summary>
/// Called at the end of aggregation, to return the results of the aggregation.
/// </summary>
/// <returns></returns>
public SqlInt32 Terminate()
{
int max = Math.Max((int) ((sbyte) _intermediateResult.CurrentSequentialDays), (sbyte) _intermediateResult.MaxSequentialDays);
return new SqlInt32(max);
}
}
}
回答5:
Seems like you could take advantage of the fact that to be continuous over n days would require there to be n rows.
So something like:
SELECT users.UserId, count(1) as cnt
FROM users
WHERE users.CreationDate > now() - INTERVAL 30 DAY
GROUP BY UserId
HAVING cnt = 30
回答6:
Doing this with a single SQL query seems overly complicated to me. Let me break this answer down in two parts.
- What you should have done until now and should start doing now:
Run a daily cron job that checks for every user wether he has logged in today and then increments a counter if he has or sets it to 0 if he hasn't.
- What you should do now:
- Export this table to a server that doesn't run your website and won't be needed for a while. ;)
- Sort it by user, then date.
- go through it sequentially, keep a counter...
回答7:
If this is so important to you, source this event and drive a table to give you this info. No need to kill the machine with all those crazy queries.
回答8:
You could use a recursive CTE (SQL Server 2005+):
WITH recur_date AS (
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
1 'level'
FROM TABLE t
UNION ALL
SELECT t.userid,
t.creationDate,
DATEADD(day, 1, t.created) 'nextDay',
rd.level + 1 'level'
FROM TABLE t
JOIN recur_date rd on t.creationDate = rd.nextDay AND t.userid = rd.userid)
SELECT t.*
FROM recur_date t
WHERE t.level = @numDays
ORDER BY t.userid
回答9:
Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is @Days)
Another lead is from SQL Team's blog on runs
The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:
WITH Runs
AS
(SELECT UserID
, CreationDate
, ROW_NUMBER() OVER(PARTITION BY UserId
ORDER BY CreationDate)
- ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak
ORDER BY CreationDate) AS RunNumber
FROM
(SELECT UH.UserID
, UH.CreationDate
, ISNULL((SELECT TOP 1 1
FROM dbo.UserHistory AS Prior
WHERE Prior.UserId = UH.UserId
AND Prior.CreationDate
BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)
AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak
FROM dbo.UserHistory AS UH) AS Consecutive
)
SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd
FROM Runs
GROUP BY UserID, RunNumber
HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= @Days
The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.
回答10:
A couple of SQL Server 2012 options (assuming N=100 below).
;WITH T(UserID, NRowsPrevious)
AS (SELECT UserID,
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(PARTITION BY UserID
ORDER BY CreationDate),
CreationDate)
FROM UserHistory)
SELECT DISTINCT UserID
FROM T
WHERE NRowsPrevious = 100
Though with my sample data the following worked out more efficient
;WITH U
AS (SELECT DISTINCT UserId
FROM UserHistory) /*Ideally replace with Users table*/
SELECT UserId
FROM U
CROSS APPLY (SELECT TOP 1 *
FROM (SELECT
DATEDIFF(DAY,
LAG(CreationDate, 100)
OVER
(ORDER BY CreationDate),
CreationDate)
FROM UserHistory UH
WHERE U.UserId = UH.UserID) T(NRowsPrevious)
WHERE NRowsPrevious = 100) O
Both rely on the constraint stated in the question that there is at most one record per day per user.
回答11:
Something like this?
select distinct userid
from table t1, table t2
where t1.UserId = t2.UserId
AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n
AND (
select count(*)
from table t3
where t1.UserId = t3.UserId
and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n
) = n
回答12:
I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.
Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):
-- show basic understand of the math properties
select ceil(max (creation_date) - min (creation_date))
max_min_days_diff,
count ( * ) real_day_count
from user_access_log
group by user_id;
-- select all users that have consecutively accessed the site
select user_id
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
-- get the count of all users that have consecutively accessed the site
select count(user_id) user_count
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
Table prep script:
-- create table
create table user_access_log (id number, user_id number, creation_date date);
-- insert seed data
insert into user_access_log (id, user_id, creation_date)
values (1, 12, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (2, 12, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (3, 12, sysdate + 2);
insert into user_access_log (id, user_id, creation_date)
values (4, 16, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (5, 16, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (6, 16, sysdate + 5);
回答13:
declare @startdate as datetime, @days as int
set @startdate = cast('11 Jan 2009' as datetime) -- The startdate
set @days = 5 -- The number of consecutive days
SELECT userid
,count(1) as [Number of Consecutive Days]
FROM UserHistory
WHERE creationdate >= @startdate
AND creationdate < dateadd(dd, @days, cast(convert(char(11), @startdate, 113) as datetime))
GROUP BY userid
HAVING count(1) >= @days
The statement cast(convert(char(11), @startdate, 113) as datetime)
removes the time part of the date so we start at midnight.
I would assume also that the creationdate
and userid
columns are indexed.
I just realized that this won't tell you all the users and their total consecutive days. But will tell you which users will have been visiting a set number of days from a date of your choosing.
Revised solution:
declare @days as int
set @days = 30
select t1.userid
from UserHistory t1
where (select count(1)
from UserHistory t3
where t3.userid = t1.userid
and t3.creationdate >= DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate), 0)
and t3.creationdate < DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate) + @days, 0)
group by t3.userid
) >= @days
group by t1.userid
I've checked this and it will query for all users and all dates. It is based on Spencer's 1st (joke?) solution, but mine works.
Update: improved the date handling in the second solution.
回答14:
This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).
DECLARE @Range as INT
SET @Range = 10
SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))
FROM tblUserLogin a
WHERE EXISTS
(SELECT 1
FROM tblUserLogin b
WHERE a.userId = b.userId
AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate)))))
FROM tblUserLogin c
WHERE c.userid = b.userid
AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) BETWEEN CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) and CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))+@Range-1) = @Range)
Creation script
CREATE TABLE [dbo].[tblUserLogin](
[Id] [int] IDENTITY(1,1) NOT NULL,
[UserId] [int] NULL,
[CreationDate] [datetime] NULL
) ON [PRIMARY]
回答15:
Spencer almost did it, but this should be the working code:
SELECT DISTINCT UserId
FROM History h1
WHERE (
SELECT COUNT(*)
FROM History
WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, @n-1, h1.CreationDate)
) >= @n
回答16:
Off the top of my head, MySQLish:
SELECT start.UserId
FROM UserHistory AS start
LEFT OUTER JOIN UserHistory AS pre_start ON pre_start.UserId=start.UserId
AND DATE(pre_start.CreationDate)=DATE_SUB(DATE(start.CreationDate), INTERVAL 1 DAY)
LEFT OUTER JOIN UserHistory AS subsequent ON subsequent.UserId=start.UserId
AND DATE(subsequent.CreationDate)<=DATE_ADD(DATE(start.CreationDate), INTERVAL 30 DAY)
WHERE pre_start.Id IS NULL
GROUP BY start.Id
HAVING COUNT(subsequent.Id)=30
Untested, and almost certainly needs some conversion for MSSQL, but I think that give some ideas.
回答17:
How about one using Tally tables? It follows a more algorithmic approach, and execution plan is a breeze. Populate the tallyTable with numbers from 1 to 'MaxDaysBehind' that you want to scan the table (ie. 90 will look for 3 months behind,etc).
declare @ContinousDays int
set @ContinousDays = 30 -- select those that have 30 consecutive days
create table #tallyTable (Tally int)
insert into #tallyTable values (1)
...
insert into #tallyTable values (90) -- insert numbers for as many days behind as you want to scan
select [UserId],count(*),t.Tally from HistoryTable
join #tallyTable as t on t.Tally>0
where [CreationDate]> getdate()-@ContinousDays-t.Tally and
[CreationDate]<getdate()-t.Tally
group by [UserId],t.Tally
having count(*)>=@ContinousDays
delete #tallyTable
回答18:
Tweaking Bill's query a bit. You might have to truncate the date before grouping to count only one login per day...
SELECT UserId from History
WHERE CreationDate > ( now() - n )
GROUP BY UserId,
DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) AS TruncatedCreationDate
HAVING COUNT(TruncatedCreationDate) >= n
EDITED to use DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) instead of convert( char(10) , CreationDate, 101 ).
@IDisposable
I was looking to use datepart earlier but i was too lazy to look up the syntax so i figured i d use convert instead. I dint know it had a significant impact Thanks! now i know.
回答19:
assuming a schema that goes like:
create table dba.visits
(
id integer not null,
user_id integer not null,
creation_date date not null
);
this will extract contiguous ranges from a date sequence with gaps.
select l.creation_date as start_d, -- Get first date in contiguous range
(
select min(a.creation_date ) as creation_date
from "DBA"."visits" a
left outer join "DBA"."visits" b on
a.creation_date = dateadd(day, -1, b.creation_date ) and
a.user_id = b.user_id
where b.creation_date is null and
a.creation_date >= l.creation_date and
a.user_id = l.user_id
) as end_d -- Get last date in contiguous range
from "DBA"."visits" l
left outer join "DBA"."visits" r on
r.creation_date = dateadd(day, -1, l.creation_date ) and
r.user_id = l.user_id
where r.creation_date is null