T-SQL query to flag repeat records

2019-05-18 22:12发布

问题:

I have a table that will have 500,000+ records. Each record has a LineNumber field which is not unique and not part of the primary key. Each record has a CreatedOn field.

I need to update all 500,000+ records to identify repeat records.

A repeat records is defined by a record that has the same LineNumber within the last seven days of its CreatedOn field.

In the diagram above row 4 is a repeat because it occurred only five days since row 1. Row 6 is not a repeat even though it occurs only four days since row 4, but row 4 itself is already a repeat, so Row 6 can only be compared to Row 1 which is nine days prior to Row 6, therefore Row 6 is not a repeat.

I do not know how to update the IsRepeat field with stepping through each record one-by-one via a cursor or something.

I do not believe cursors is the way to go, but I'm stuck with any other possible solution.

I have considered maybe Common Table Expressions may be of help but I have no experience with them and have no idea where to start.

Basically this same process needs to be done on the table every day as the table is truncated and re-populated every single day. Once the table is re-populated, I have to go through and re-mark each record if it is a repeat or not.

Some assistance would be most appreciated.

UPDATE

Here is a script to create a table and insert test data

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
DROP TABLE [dbo].[Job]
GO

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[Job](
    [JobID] [int] IDENTITY(1,1) NOT NULL,
    [LineNumber] [nvarchar](20) NULL,
    [IsRepeat] [bit] NULL,
    [CreatedOn] [smalldatetime] NOT NULL,
 CONSTRAINT [PK_Job] PRIMARY KEY CLUSTERED 
(
    [JobID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO


SET NOCOUNT ON

INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-01 07:52:08')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-01 08:30:01')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-01 09:30:35')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-01 10:51:10')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-02 09:22:30')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-02 10:27:28')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-02 11:15:33')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-02 13:01:13')
INSERT INTO dbo.Job VALUES ('1014',NULL,'2009-07-03 12:05:56')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-03 13:57:34')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-03 15:38:54')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-04 16:32:20')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-05 13:46:46')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-05 15:08:35')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-05 15:19:50')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-05 16:37:19')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-05 17:14:09')
INSERT INTO dbo.Job VALUES ('1009',NULL,'2009-07-05 20:55:08')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-06 08:29:29')
INSERT INTO dbo.Job VALUES ('1002',NULL,'2009-07-07 11:22:38')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-07 12:25:23')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-08 09:32:07')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-08 09:46:33')
INSERT INTO dbo.Job VALUES ('1016',NULL,'2009-07-08 10:09:08')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-09 10:45:04')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-09 11:31:23')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-09 13:10:06')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-09 15:04:06')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-09 17:32:16')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-09 19:51:28')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-10 15:09:42')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-10 16:15:31')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-10 21:55:43')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-11 08:49:03')
INSERT INTO dbo.Job VALUES ('1022',NULL,'2009-07-11 16:47:21')
INSERT INTO dbo.Job VALUES ('1026',NULL,'2009-07-11 18:23:16')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-11 19:49:31')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-12 11:57:26')
INSERT INTO dbo.Job VALUES ('1003',NULL,'2009-07-13 08:32:20')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-13 09:31:32')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 09:52:54')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 11:22:31')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-14 11:54:14')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-14 15:17:08')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-15 13:27:08')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-15 14:10:56')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-15 15:20:50')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-15 15:39:18')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-15 16:06:17')
INSERT INTO dbo.Job VALUES ('1017',NULL,'2009-07-16 11:52:08')

SET NOCOUNT OFF
GO

回答1:

Ignores LineNumber is null. How should IsRepeat be handled in that case?

It works for test data. Whether it will be efficient enough for production volumes?

In the case of duplicate (LineNumber, CreatedOn) on pairs, arbitrarily choose one. (The one with minimum JobId)

Basic idea:

  1. Get all JobId pairs that are at least seven days apart, by line number.
  2. Count the number of rows that are more than seven days from the left side, upto and including the right side. (CNT)
  3. Then we know if JobId x is not a repeat, the next not a repeat is the pair with X on the left side, and CNT = 1
  4. Use recursive CTE to start with the first row for each LineNumber
  5. Recursive element uses the pair with counts to get the next row.
  6. Finally update, setting all IsRepeat to 0 for non-repeats and 1 for everything else.

; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
            , LeftJobId
            , RightJobId
            , BeginCreatedOn
            , EndCreatedOn) as
        (select l.LineNumber
            , l.JobId
            , r.JobId
            , dateadd(day, 7, l.CreatedOn)
            , r.CreatedOn
        from Job l
        inner join Job r
            on l.LineNumber = r.LineNumber
            and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
            and l.JobId <> r.JobId)
    -- Count the number of rows within from BeginCreatedOn 
    -- up to and including EndCreatedOn
    -- In the case of CreatedOn = EndCreatedOn, 
    -- include only jobId <= jobid, to handle ties in CreatedOn        
    , AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
        (select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
        from AllPairsByLineNumberAtLeast7DaysApart ap
        inner join Job j
            on j.LineNumber = ap.LineNumber
            and ap.BeginCreatedOn <= j.createdOn
            and (j.CreatedOn < ap.EndCreatedOn
                or (j.CreatedOn = ap.EndCreatedOn 
                    and j.JobId <= ap.RightJobId))
         group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
    , Step1 (LineNumber, JobId, CreatedOn, RN) as
        (select LineNumber, JobId, CreatedOn
            , row_number() over 
                (partition by LineNumber order by CreatedOn, JobId)
        from Job)
    , Results (JobId, LineNumber, CreatedOn) as    
        -- Start with the first rows.
        (select JobId, LineNumber, CreatedOn
        from Step1
        where RN = 1
        and LineNumber is not null
        -- get the next row
        union all
        select j.JobId, j.LineNumber, j.CreatedOn
        from Results r
        inner join AllPairsCount apc on apc.LeftJobId = r.JobId
        inner join Job j
            on j.JobId = apc.RightJobId
            and apc.CNT = 1)
    update j
    set IsRepeat = case when R.JobId is not null then 0 else 1 end
    from Job j
    left outer join Results r
        on j.JobId = R.JobId
    where j.LineNumber is not null

EDIT:

After I turned off the computer last night I realized I had made things more complicated than they needed to be. A more straightforward (and on the test data, slightly more effecient) query:

Basic Idea:

  1. Generated PotentialStep (FromJobId, ToJobId) These are the pairs where if FromJobId is not a repeat, than ToJobId is also not a repeat. (First row by LineNumber more than seven days from FromJobId)
  2. Use a recursive CTE to start from the first JobId for each LineNumber and then step, using PontentialSteps, to each Non Repeating JobId

; with PotentialSteps (FromJobId, ToJobId) as
    (select FromJobId, ToJobId
    from (select f.JobId as FromJobId
            , t.JobId as ToJobId
            , row_number() over
                 (partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
        from Job f
        inner join Job t
            on f.LineNumber = t.LineNumber
            and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
        where RN = 1)
, NonRepeats (JobId) as
    (select JobId
    from (select JobId
            , row_number() over
                (partition by LineNumber order by CreatedOn, JobId) as RN
        from Job) Start
    where RN = 1
    union all
    select J.JobId
    from NonRepeats NR
    inner join PotentialSteps PS
        on NR.JobId = PS.FromJobId
    inner join Job J
        on PS.ToJobId = J.JobId)
update J
set IsRepeat = case when NR.JobId is not null then 0 else 1 end
from Job J
left outer join NonRepeats NR
on J.JobId = NR.JobId
where J.LineNumber is not null


回答2:

UPDATE Jobs 
SET Jobs.IsRepeat = 0 -- mark all of them IsRepeat = false

UPDATE Jobs 
SET Jobs.IsRepeat = 1
WHERE EXISTS 
   (SELECT TOP 1 i.LineNumber FROM Jobs i WHERE i.LineNumber = Jobs.LineNumber
    AND i.CreatedOn <> Jobs.CreatedOn and i.CreatedOn BETWEEN Jobs.CreatedOn - 7 
    AND Jobs.CreatedOn)

NOTE: I hope this helps you somewhat. Let me know, if you find any discrepancy that you will come across on a larger data set.



回答3:

I'm not proud of this, it makes many assumptions (e.g. that CreatedOn is date only, and (LineNUmber,CreatedOn) is a key. Some tuning may be required, only works with test data.

In other words, I created this more for intellectual curiosity rather than because I think it's a genuine solution. Final select could be an update to set IsRepeat in the base table, based on existence on rows in V4. Final note before letting people see evil - could people please post test data in comments for data sets that it doesn't work for. It might be possible to turn this into a real solution:

with V1 as (
select t1.LineNumber,t1.CreatedOn,t2.CreatedOn as PrevDate from
T1 t1 inner join T1 t2 on t1.LineNumber = t2.LineNumber and t1.CreatedOn > t2.CreatedOn and DATEDIFF(DAY,t2.CreatedOn,t1.CreatedOn) < 7
), V2 as (
select v1.LineNumber,v1.CreatedOn,V1.PrevDate from V1
union all
select v1.LineNumber,v1.CreatedOn,v2.PrevDate from v1 inner join v2 on V1.LineNumber = v2.LineNumber and v1.PrevDate = v2.CreatedOn
), V3 as (
select LineNumber,CreatedOn,MIN(PrevDate) as PrevDate from V2 group by LineNumber,CreatedOn
), V4 as (
select LineNumber,CreatedOn from V3 where DATEDIFF(DAY,PrevDate,CreatedOn) < 7
)
select
    T1.LineNumber,
    T1.CreatedOn,
    CASE WHEN V4.LineNumber is Null then 0 else 1 end as IsRepeat
from
    T1
        left join
    V4
        on
            T1.LineNumber = V4.LineNumber and
            T1.CreatedOn = V4.CreatedOn
order by T1.CreatedOn,T1.LineNumber
option (maxrecursion 7)


标签: tsql