I have a 'Service' table with millions of rows. Each row corresponds to a service provided by a staff in a given date and time interval (Each row has a unique ID). There are cases where a staff might provide services in overlapping time frames. I need to write a query that merges overlapping time intervals and returns the data in the format shown below.
I tried grouping by StaffID and Date fields and getting the Min of BeginTime and Max of EndTime but that does not account for the non-overlapping time frames. How can I accomplish this? Again, the table contains several million records so a recursive CTE approach might have performance issues. Thanks in advance.
Service Table
ID StaffID Date BeginTime EndTime
1 101 2014-01-01 08:00 09:00
2 101 2014-01-01 08:30 09:30
3 101 2014-01-01 18:00 20:30
4 101 2014-01-01 19:00 21:00
Output
StaffID Date BeginTime EndTime
101 2014-01-01 08:00 09:30
101 2014-01-01 18:00 21:00
Here is another sample data set with a query proposed by a contributor. http://sqlfiddle.com/#!6/bfbdc/3
The first two rows in the results set should be merged into one row (06:00-08:45) but it generates two rows (06:00-08:30 & 06:00-08:45)
Elsewhere I've answered a similar Date Packing question with a geometric strategy. Namely, I interperet the date ranges as a line, and utilize
geometry::UnionAggregate
to merge the ranges.Your question has two peculiarities though. First, it calls for sql-server-2008.
geometry::UnionAggregate
is not then avialable. However, download the microsoft library at https://github.com/microsoft/SQLServerSpatialTools and load it in as a clr assembly to your instance and you have it available asdbo.GeometryUnionAggregate
.But the real peculiarity that has my interest is the concern that you have several million rows to work with. So I thought I'd repeat the strategy here but with an added technique to improve it's performance. This technique will work well if you have a lot of your StaffID/date subsets that are the same.
First, let's build a numbers table. Swap this out with your favorite way to do it.
Then convert the dates to floats and use those floats to create geometrical points.
These points can then be turned into lines via STUnion and STEnvelope.
With your ranges now represented as geometric lines, merge them via
UnionAggregate
. The resulting geometry object 'lines' might contain multiple lines. But any overlapping lines turn into one line.You have one 'lines' object for each staffId/date combo. But depending on your dataset, there may be many 'lines' objects that are the same between these combos. This may very well be true if staff are expected to follow a routine and data is recorded to the nearest whatever.
So get a distinct lising of 'lines' objects. This should improve performance.
From this, extract the individual lines inside 'lines'. Envelope the lines, which ensures that the lines are stored only as their endpoints. Read the endpoint x values and convert them back to their time representations.
Keep the WKT representation to join it back to the combos later on.
Now just join your parsed data back to the StaffId/Date combos.
I only came up with a CTE query as the problem is there may be a chain of overlapping times, e.g. record 1 overlaps with record 2, record 2 with record 3 and so on. This is hard to resolve without CTE or some other kind of loops, etc. Please give it a go anyway.
The first part of the CTE query gets the services that start a new group and are do not have the same starting time as some other service (I need to have just one record that starts a group). The second part gets those that start a group but there's more then one with the same start time - again, I need just one of them. The last part recursively builds up on the starting group, taking all overlapping services.
Here is SQLFiddle with more records added to demonstrate different kinds of overlapping and duplicate times.
I couldn't use
ServiceID
as it would have to be ordered in the same way asBeginTime
.