Find total time worked with multiple jobs / orders

2019-08-25 06:14发布

问题:

I searched night and day back when I was first starting out in the sql world for an answer to this question. Could not find anything similar to this for my needs so I decided to ask and answer my own question in case others need help like I did.

Here is an example of the data I have. For simplicity, it is all from the Job table. Each JobID has it's own Start and End time that are basically random and can overlap, have gaps, start and end at the same time as other jobs etc.

--Available--
JobID  WorkerID  JobStart             JobEnd
1      25        '2012-11-17 16:00'  '2012-11-17 17:00'
2      25        '2012-11-18 16:00'  '2012-11-18 16:50'
3      25        '2012-11-19 18:00'  '2012-11-19 18:30'
4      25        '2012-11-19 17:30'  '2012-11-19 18:10'
5      26        '2012-11-18 16:00'  '2012-11-18 17:10'
6      26        '2012-11-19 16:00'  '2012-11-19 16:50'

What I wanted the result of the query to show would be:

WorkerID  TotalTime(in Mins)
25        170
26        120

EDIT: Forgot to mention that the overlaps need to be ignored. Basically this is supposed to treat these workers and their jobs like you would an hourly employee and not a contractor. Like if I worked two jobIDs and started and finished them both from 12:00pm to 12:30pm, as an employee I would only get paid for 30 mins, whereas a contractor would likely get paid 60 mins, since their jobs are treated individually and get paid per job. The point of this query is to analyze jobs in a database that are tied to a worker, and need to find out if that worker was treated as an employee, what would his total hours worked in a given set of time come out to be.

EDIT2: won't let me answer my own question for 7 hours, will move it there later.

Ok, Answering Question now. Basically, I use temp table to build each minute between the min and max datetime of the jobs I am looking up.

IF OBJECT_ID('tempdb..#time') IS NOT NULL
BEGIN
drop table #time
END
DECLARE @FromDate AS DATETIME,
     @ToDate AS DATETIME,
     @Current AS DATETIME
SET @FromDate = '2012-11-17 16:00'
SET @ToDate = '2012-11-19 18:30'

create table #time  (cte_start_date datetime)
set @current = @FromDate
while (@current < @ToDate)
begin

insert into #time (cte_start_date)
values (@current)

set @current = DATEADD(n, 1, @current)

end

Now I have all the mins in a temp table. Now I need to join all the Job table info into it and select out what I need in one go.

SELECT J.WorkerID
,COUNT(DISTINCT t.cte_start_date) AS TotalTime
FROM #time AS t
INNER JOIN Job AS J ON t.cte_start_date >= J.JobStart AND t.cte_start_date < J.JobEnd --Thanks ErikE
GROUP BY J.WorkerID --Thanks Martin Parkin

drop table #time

That is the very simplified answer and is good to get someone started.

回答1:

This query does the job as well. Its performance is very good (while the execution plan looks not so great, the actual CPU and IO beat many other queries).

See it working in a Sql Fiddle.

WITH Times AS (
   SELECT DISTINCT
      H.WorkerID,
      T.Boundary
   FROM
      dbo.JobHistory H
      CROSS APPLY (VALUES (H.JobStart), (H.JobEnd)) T (Boundary)
), Groups AS (
   SELECT
      WorkerID,
      T.Boundary,
      Grp = Row_Number() OVER (PARTITION BY T.WorkerID ORDER BY T.Boundary) / 2
   FROM
      Times T
      CROSS JOIN (VALUES (1), (1)) X (Dup)
), Boundaries AS (
   SELECT
      G.WorkerID,
      TimeStart = Min(Boundary),
      TimeEnd = Max(Boundary)
   FROM
      Groups G
   GROUP BY
      G.WorkerID,
      G.Grp
   HAVING
      Count(*) = 2
)
SELECT
   B.WorkerID,
   WorkedMinutes = Sum(DateDiff(minute, 0, B.TimeEnd - B.TimeStart))
FROM
   Boundaries B
WHERE
   EXISTS (
      SELECT *
      FROM dbo.JobHistory H
      WHERE
         B.WorkerID = H.WorkerID
         AND B.TimeStart < H.JobEnd
         AND B.TimeEnd > H.JobStart
   )
GROUP BY
   WorkerID
;

With a clustered index on WorkerID, JobStart, JobEnd, JobID, and with the sample 7 rows from the above fiddle a template for new worker/job data repeated enough times to yield a table with 14,336 rows, here are the performance results. I've included the other working/correct answers on the page (so far):

Author  CPU  Elapsed  Reads   Scans
------  ---  -------  ------  -----
  Erik  157    166      122       2
Gordon  375    378    106964  53251

I did a more exhaustive test from a different (slower) server (where each query was run 25 times, the best and worst values for each metric were thrown out, and the remaining 23 values were averaged) and got the following:

Query     CPU   Duration  Reads   Notes
--------  ----  --------  ------  ----------------------------------
Erik 1    215   231       122     query as above
Erik 2    326   379       116     alternate technique with no EXISTS
Gordon 1  578   682       106847  from j
Gordon 2  584   673       106847  from dbo.JobHistory

The alternate technique I thought to be sure to improve things. Well, it saved 6 reads, but cost a lot more CPU (which makes sense). Instead of carrying through the start/end statistics of each timeslice to the end, it is best just recalculating which slices to keep with the EXISTS against the original data. It may be that a different profile of few workers with many jobs could change the performance statistics for different queries.

In case anyone wants to try it, use the CREATE TABLE and INSERT statements from my fiddle and then run this 11 times:

INSERT dbo.JobHistory
SELECT
   H.JobID + A.MaxJobID,
   H.WorkerID + A.WorkerCount,
   DateAdd(minute, Elapsed + 45, JobStart),
   DateAdd(minute, Elapsed + 45, JobEnd)
FROM
   dbo.JobHistory H
   CROSS JOIN (
      SELECT
         MaxJobID = Max(JobID),
         WorkerCount = Max(WorkerID) - Min(WorkerID) + 1,
         Elapsed = DateDiff(minute, Min(JobStart), Min(JobEnd))
      FROM dbo.JobHistory
   ) A
;

I built two other solutions to this query but the best one with about double the performance had a fatal flaw (not correctly handling fully enclosed time ranges). The other had very high/bad statistics (which I knew but had to try).

Explanation

Using all the endpoint times from each row, build up a distinct list of all possible time ranges of interest by duplicating each endpoint time and then grouping in such a way as to pair each time with the next possible time. Sum the elapsed minutes of these ranges wherever they coincide with any actual worker's working time.



回答2:

A query such as the following should provide the answer you are looking for:

SELECT  WorkerID,
    SUM(DATEDIFF(minute, JobStart, JobEnd)) AS TotalTime
  FROM  Job
  GROUP BY WorkerID

Apologies that it is untested (I have no SQL Server to test it here) but it should do the trick.



回答3:

This is a complicated query. Explanation follows.

with j as (
     select j.*,
            (select 1
             from jobs j2
             where j2.workerid = j.workerid and
                   j2.starttime < j.endtime and
                   j2.starttime > j.starttime
            ) as HasOverlap
     from jobs j
    )
select workerId,
       sum(datediff(minute, periodStart, PeriodEnd)) as NumMinutes
from (select workerId, min(startTime) as periodStart, max(endTime) as PeriodEnd
      from (select j.*,
                   (select min(starttime)
                    from j j2
                    where j2.workerid = j.workerid and
                          j2.starttime >= j.starttime and
                          j2.HasOverlap is null
                   ) as thegroup
            from j
           ) j
      group by workerId, thegroup
     ) j
group by workerId;

The key to understanding this approach is to understand the "overlap" logic. One time period overlaps with the next when the next start time is before the previous end time. By assigning an overlap flag to each record, we know if it overlaps with the "next" record. The above logic is using the start time for this. It might be better to use the JobId, especially if two jobs for the same worker could start at the same time.

The calculation of the overlap flag uses a correlated subquery (this is j in the with clause).

Then, for each record we go back and find the first record afterwards where the overlap value is NULL. This provides a grouping key for all records in a given overlap set.

The rest, then, is just to aggregate the results, first at the workerId/group level and then at the workerId level to get the final results.

I have not run this SQL, so it might have syntax errors.