How do I find records that have the same value in

2019-05-24 08:43发布

问题:

Finding the start and end time for adjacent records that have the same value?

I have a table that contains heart rate readings (in beats per minute) and datetime field. (Actually the fields are heartrate_id, heartrate, and datetime.) The data are generated by a device that records the heart rate and time every 6 seconds. Sometimes the heart rate monitor will give false readings and the recorded beats per minute will "stick" for an period of time. By sticks, I mean the beats per minute value will be identical in adjacent times.

Basically I need to find all the records where the heart rate is the same (e.g. 5 beats per minute, 100 beats per minute, etc.) in but only on adjacent records. If the device records 25 beats per minute for 3 consecutive reading (or 100 consecutive readings) I need to locate these events. The results need to have the heartrate, time the heartrate started, and the time the heart rate ended and ideally the results would look more of less like this:

heartrate starttime endtime
--------- --------- --------
1.00      21:12:00  21:12:24
35.00     07:00:12  07:00:36

I've tried several different approaches but so far I'm striking out. Any help would be greatly appreciated!

回答1:

EDIT:

Upon review, none of my original work on this answer was very good. This actually belongs to the class of problems known as gaps-and-islands, and this revised answer will use information I've gleaned from similar questions/learned since first answering this question.

It turns out this query can be done a lot more simply than I originally thought:

WITH Grouped_Run AS (SELECT heartRate, dateTime,
                            ROW_NUMBER() OVER(ORDER BY dateTime) -  
                            ROW_NUMBER() OVER(PARTITION BY heartRate ORDER BY dateTime) AS groupingId
                     FROM HeartRate)

SELECT heartRate, MIN(dateTime), MAX(dateTime)
FROM Grouped_Run
GROUP BY heartRate, groupingId
HAVING COUNT(*) > 2

SQL Fiddle Demo


So what's happening here? One of the definitions of gaps-and-islands problems is the need for "groups" of consecutive values (or lack thereof). Often sequences are generated to solve this, exploiting an often overlooked/too-intuitive fact: subtracting sequences yields a constant value.

For example, imagine the following sequences, and the subtraction (the values in the rows are unimportant):

position   positionInGroup  subtraction
=========================================
1          1                0
2          2                0
3          3                0
4          1                3
5          2                3
6          1                5
7          4                3
8          5                3

position is a simple sequence generated over all records.
positionInGroup is a simple sequence generated for each set of different records. In this case, there's actually 3 different sets of records (starting at position = 1, 4, 6).
subtraction is the result of the difference between the other two columns. Note that values may repeat for different groups!
One of the key properties the sequences must share is they must be generated over the rows of data in the same order, or this breaks.

So how is SQL doing this? Through the use of ROW_NUMBER() this function will generate a sequence of numbers over a "window" of records:

ROW_NUMBER() OVER(ORDER BY dateTime)

will generate the position sequence.

ROW_NUMBER() OVER(PARTITION BY heartRate ORDER BY dateTime)

will generate the positionInGroup sequence, with each heartRate being a different group.
In the case of most queries of this type, the values of the two sequences is unimportant, it's the subtraction (to get the sequence group) that matters, so we just need the result of the subtraction.
We'll also need the heartRate and the times in which they occurred to provide the answer.

The original answer asked for the start and end times of each of the "runs" of stuck heartbeats. That's a standard MIN(...)/MAX(...), which means a GROUP BY. We need to use both the original heartRate column (because that's a non-aggregate column) and our generated groupingId (which identifies the current "run" per stuck value).

Part of the question asked for only runs that repeated three or more times. The HAVING COUNT(*) > 2 is an instruction to ignore runs of length 2 or less; it counts rows per-group.



回答2:

I recommend Ben-Gan's article on interval packing, which applies to your adjacency problem.

tsql-challenge-packing-date-and-time-intervals

solutions-to-packing-date-and-time-intervals-puzzle