I searched night and day back when I was first starting out in the sql world for an answer to this question. Could not find anything similar to this for my needs so I decided to ask and answer my own question in case others need help like I did.
Here is an example of the data I have. For simplicity, it is all from the Job table. Each JobID has it's own Start and End time that are basically random and can overlap, have gaps, start and end at the same time as other jobs etc.
--Available--
JobID WorkerID JobStart JobEnd
1 25 '2012-11-17 16:00' '2012-11-17 17:00'
2 25 '2012-11-18 16:00' '2012-11-18 16:50'
3 25 '2012-11-19 18:00' '2012-11-19 18:30'
4 25 '2012-11-19 17:30' '2012-11-19 18:10'
5 26 '2012-11-18 16:00' '2012-11-18 17:10'
6 26 '2012-11-19 16:00' '2012-11-19 16:50'
What I wanted the result of the query to show would be:
WorkerID TotalTime(in Mins)
25 170
26 120
EDIT: Forgot to mention that the overlaps need to be ignored. Basically this is supposed to treat these workers and their jobs like you would an hourly employee and not a contractor. Like if I worked two jobIDs and started and finished them both from 12:00pm to 12:30pm, as an employee I would only get paid for 30 mins, whereas a contractor would likely get paid 60 mins, since their jobs are treated individually and get paid per job. The point of this query is to analyze jobs in a database that are tied to a worker, and need to find out if that worker was treated as an employee, what would his total hours worked in a given set of time come out to be.
EDIT2: won't let me answer my own question for 7 hours, will move it there later.
Ok, Answering Question now. Basically, I use temp table to build each minute between the min and max datetime of the jobs I am looking up.
IF OBJECT_ID('tempdb..#time') IS NOT NULL
BEGIN
drop table #time
END
DECLARE @FromDate AS DATETIME,
@ToDate AS DATETIME,
@Current AS DATETIME
SET @FromDate = '2012-11-17 16:00'
SET @ToDate = '2012-11-19 18:30'
create table #time (cte_start_date datetime)
set @current = @FromDate
while (@current < @ToDate)
begin
insert into #time (cte_start_date)
values (@current)
set @current = DATEADD(n, 1, @current)
end
Now I have all the mins in a temp table. Now I need to join all the Job table info into it and select out what I need in one go.
SELECT J.WorkerID
,COUNT(DISTINCT t.cte_start_date) AS TotalTime
FROM #time AS t
INNER JOIN Job AS J ON t.cte_start_date >= J.JobStart AND t.cte_start_date < J.JobEnd --Thanks ErikE
GROUP BY J.WorkerID --Thanks Martin Parkin
drop table #time
That is the very simplified answer and is good to get someone started.
This query does the job as well. Its performance is very good (while the execution plan looks not so great, the actual CPU and IO beat many other queries).
See it working in a Sql Fiddle.
WITH Times AS (
SELECT DISTINCT
H.WorkerID,
T.Boundary
FROM
dbo.JobHistory H
CROSS APPLY (VALUES (H.JobStart), (H.JobEnd)) T (Boundary)
), Groups AS (
SELECT
WorkerID,
T.Boundary,
Grp = Row_Number() OVER (PARTITION BY T.WorkerID ORDER BY T.Boundary) / 2
FROM
Times T
CROSS JOIN (VALUES (1), (1)) X (Dup)
), Boundaries AS (
SELECT
G.WorkerID,
TimeStart = Min(Boundary),
TimeEnd = Max(Boundary)
FROM
Groups G
GROUP BY
G.WorkerID,
G.Grp
HAVING
Count(*) = 2
)
SELECT
B.WorkerID,
WorkedMinutes = Sum(DateDiff(minute, 0, B.TimeEnd - B.TimeStart))
FROM
Boundaries B
WHERE
EXISTS (
SELECT *
FROM dbo.JobHistory H
WHERE
B.WorkerID = H.WorkerID
AND B.TimeStart < H.JobEnd
AND B.TimeEnd > H.JobStart
)
GROUP BY
WorkerID
;
With a clustered index on WorkerID, JobStart, JobEnd, JobID
, and with the sample 7 rows from the above fiddle a template for new worker/job data repeated enough times to yield a table with 14,336 rows, here are the performance results. I've included the other working/correct answers on the page (so far):
Author CPU Elapsed Reads Scans
------ --- ------- ------ -----
Erik 157 166 122 2
Gordon 375 378 106964 53251
I did a more exhaustive test from a different (slower) server (where each query was run 25 times, the best and worst values for each metric were thrown out, and the remaining 23 values were averaged) and got the following:
Query CPU Duration Reads Notes
-------- ---- -------- ------ ----------------------------------
Erik 1 215 231 122 query as above
Erik 2 326 379 116 alternate technique with no EXISTS
Gordon 1 578 682 106847 from j
Gordon 2 584 673 106847 from dbo.JobHistory
The alternate technique I thought to be sure to improve things. Well, it saved 6 reads, but cost a lot more CPU (which makes sense). Instead of carrying through the start/end statistics of each timeslice to the end, it is best just recalculating which slices to keep with the EXISTS
against the original data. It may be that a different profile of few workers with many jobs could change the performance statistics for different queries.
In case anyone wants to try it, use the CREATE TABLE
and INSERT
statements from my fiddle and then run this 11 times:
INSERT dbo.JobHistory
SELECT
H.JobID + A.MaxJobID,
H.WorkerID + A.WorkerCount,
DateAdd(minute, Elapsed + 45, JobStart),
DateAdd(minute, Elapsed + 45, JobEnd)
FROM
dbo.JobHistory H
CROSS JOIN (
SELECT
MaxJobID = Max(JobID),
WorkerCount = Max(WorkerID) - Min(WorkerID) + 1,
Elapsed = DateDiff(minute, Min(JobStart), Min(JobEnd))
FROM dbo.JobHistory
) A
;
I built two other solutions to this query but the best one with about double the performance had a fatal flaw (not correctly handling fully enclosed time ranges). The other had very high/bad statistics (which I knew but had to try).
Explanation
Using all the endpoint times from each row, build up a distinct list of all possible time ranges of interest by duplicating each endpoint time and then grouping in such a way as to pair each time with the next possible time. Sum the elapsed minutes of these ranges wherever they coincide with any actual worker's working time.
A query such as the following should provide the answer you are looking for:
SELECT WorkerID,
SUM(DATEDIFF(minute, JobStart, JobEnd)) AS TotalTime
FROM Job
GROUP BY WorkerID
Apologies that it is untested (I have no SQL Server to test it here) but it should do the trick.
This is a complicated query. Explanation follows.
with j as (
select j.*,
(select 1
from jobs j2
where j2.workerid = j.workerid and
j2.starttime < j.endtime and
j2.starttime > j.starttime
) as HasOverlap
from jobs j
)
select workerId,
sum(datediff(minute, periodStart, PeriodEnd)) as NumMinutes
from (select workerId, min(startTime) as periodStart, max(endTime) as PeriodEnd
from (select j.*,
(select min(starttime)
from j j2
where j2.workerid = j.workerid and
j2.starttime >= j.starttime and
j2.HasOverlap is null
) as thegroup
from j
) j
group by workerId, thegroup
) j
group by workerId;
The key to understanding this approach is to understand the "overlap" logic. One time period overlaps with the next when the next start time is before the previous end time. By assigning an overlap flag to each record, we know if it overlaps with the "next" record. The above logic is using the start time for this. It might be better to use the JobId, especially if two jobs for the same worker could start at the same time.
The calculation of the overlap flag uses a correlated subquery (this is j
in the with
clause).
Then, for each record we go back and find the first record afterwards where the overlap
value is NULL. This provides a grouping key for all records in a given overlap set.
The rest, then, is just to aggregate the results, first at the workerId
/group level and then at the workerId
level to get the final results.
I have not run this SQL, so it might have syntax errors.