I have some data that contains dates. I'm trying to group the data by consecutive dates, however, the dates are not exactly consecutive. Here is an example:
DateColumn | Value
------------------------+-------
2017-01-18 01:12:34.107 | 215426 <- batch no. 1
2017-01-18 01:12:34.113 | 215636
2017-01-18 01:12:34.623 | 123516
2017-01-18 01:12:34.633 | 289926
2017-01-18 04:58:42.660 | 259063 <- batch no. 2
2017-01-18 04:58:42.663 | 261830
2017-01-18 04:58:42.893 | 219835
2017-01-18 04:58:42.907 | 250165
2017-01-18 05:18:14.660 | 134253 <- batch no. 3
2017-01-18 05:18:14.663 | 134257
2017-01-18 05:18:14.667 | 134372
2017-01-18 05:18:15.040 | 181679
2017-01-18 05:18:15.043 | 226368
2017-01-18 05:18:15.043 | 227070
The data is generated in batches and each row inside a batch takes a few milliseconds to generate. I'm trying to group the results as follows:
Date1 | Date2 | Count
------------------------+-------------------------+------
2017-01-18 01:12:34.107 | 2017-01-18 01:12:34.633 | 4
2017-01-18 04:58:42.660 | 2017-01-18 04:58:42.907 | 4
2017-01-18 05:18:14.660 | 2017-01-18 05:18:15.043 | 6
It is safe to assume that if two consecutive rows are more than 1 minute apart then they belong to a different batch.
I tried solutions involving ROW_NUMBER
function but they work with consecutive dates (date difference between two rows is fixed). How can I achieve desired result when the difference is fuzzy?
Please note that a batch could be much longer than a minute. For example a batch might consist of rows starting from 2017-01-01 00:00:00 and ending at 2017-01-01 00:05:00 consisting of ~3000 rows and each row few dozen or hundred millisecond apart. What is for certain is that batches are at least 1 minute apart.
Try this:
select min(t.dateColumn) date1, max(t.dateColumn) date2, count(*)
from (
select t.*, sum(val) over (
order by t.dateColumn
) grp
from (
select t.*, case
when datediff(ms, lag(t.dateColumn, 1, t.dateColumn) over (
order by t.dateColumn
), t.dateColumn) > 60000
then 1
else 0
end val
from your_table t
) t
) t
group by grp;
Produces:
uses the analytic function lag()
to mark starting of next batch based on the difference of datecolumn
from the last one and then use analytic sum()
on it to create group of batches and then group by it to find required aggregates.
There may be some misclassification in groups due to rounding issues with DATETIME
. From MSDN
,
datetime values are rounded to increments of .000, .003, or .007 seconds, as shown in the following table.
Here is the same query rewritten using CTEs:
WITH cte1(DateColumn, ValueColumn) AS (
-- Insert your query that returns a datetime column and any other column
SELECT
SomeDate,
SomeValue
FROM SomeTable
WHERE SomeColumn IS NOT NULL
), cte2 AS (
-- This query adds a column called "val" that contains
-- 1 when current row date - previous row date > 1 minute
-- 0 otherwise
SELECT
cte1.*,
CASE WHEN DATEDIFF(MS, LAG(DateColumn, 1, DateColumn) OVER (ORDER BY DateColumn), DateColumn) > 60000 THEN 1 ELSE 0 END AS val
FROM cte1
), cte3 AS (
-- This query adds a column called "grp" that numbers
-- the groups using running sum over the "val" column
SELECT
cte2.*,
SUM(val) OVER (ORDER BY DateColumn) AS grp
FROM cte2
)
SELECT
MIN(DateColumn) Date1,
MAX(DateColumn) Date2,
COUNT(ValueColumn) [Count]
FROM cte3
GROUP BY grp
Remove seconds
and milliseconds
from DateColumn
and do the grouping
select min(DateColumn),
max(DateColumn),
count(*)
from Yourtable
group by DATEADD(MINUTE, DATEDIFF(MINUTE, 0, DateColumn), 0)
Here is some questions on truncating seconds for datetime
Truncate seconds and milliseconds in SQL
A way to extract from a DateTime value data without seconds
This does not works, if youre comparing gaps between dates (60s). But you can try this, if you need to get records, that belongs to same minute X.
SELECT
[Date1] = MIN([DateColumn])
,[Date2] = MAX([DateColumn])
,[Count] = COUNT([DateColumn])
FROM
[my_table]
GROUP BY
DATEADD(mi, DATEDIFF(mi, 0, [DateColumn]), 0);