Can this be simplified and therefore optimized

2019-08-11 08:09发布

问题:

We have this data set:

CREATE TABLE #Changes 
    (
    [GUID] varchar(250), 
    Value numeric(36,6), 
    DocumentNumber varchar(250), 
    Approved bit, 
    ApprovedDate varchar(250), 
    IssuedDate varchar(250), 
    Category varchar(250)
    );
INSERT INTO #Changes 
    (
    [GUID], 
    DocumentNumber, 
    Approved, 
    Value, 
    ApprovedDate, 
    IssuedDate, 
    Category
    ) 
 values
 ('4F7253A4E1B3D841B84D4A82B4F0E7A2', 11, 0, 18526.7, '', '2009-03-31T05:00:00Z', 'UNKNOWN'),
 ('D97537852E927B499C21C14F3D13CF06', 1, 0, 0, '', '2008-11-10T05:00:00Z', 'UNKNOWN'),
 ('857DADB463807345918729B33399B36F', 2, 0, 0, '', '2008-11-10T05:00:00Z', 'UNKNOWN'),
 ('7989D242E05AFF4FB5EE99114822BF80', 21, 0, 50112, '', '2009-07-22T05:00:00Z', 'UNKNOWN'),
 ('16A0AB27FD3A784D9E0A14406C7683E0', 3, 0, 0, '', '2009-01-15T05:00:00Z', 'UNKNOWN'),
 ('D3D7B1C306D38C438FC3DEDFCB57D411', 131, 0, 17204, '', '2010-12-14T05:00:00Z', 'UNKNOWN'),
 ('2C89D974DDF86743A0D7D62B385FBDEF', 147, 0, 0, '', '2010-12-01T05:00:00Z', 'UNKNOWN'),
 ('F371D4237C837D448824697EB0162905', 198, 0, 0, '', '2011-01-10T05:00:00Z', 'UNKNOWN'),
  ('433D64C871AE4E46A0E1BFCE2BB69BA7', 364, 0, 0, '', '2011-11-14T05:00:00Z', 'UNKNOWN'),
  ('808496DBDE76CB4F911396BB817724F3', 352, 0, 0, '', '2011-10-17T05:00:00Z', 'UNKNOWN'),
  ('9545DEF1666B5F4D8626F19F8E9E9333', 418, 0, 10948, '', '2012-03-07T22:19:18Z', 'UNKNOWN'),
  ('244D7D89B79E0F4E91100E4ADB300656', 439, 0, 50945, '', '2012-04-27T20:33:26Z', 'UNKNOWN'),
  ('115A427BBB1D2C43BA11D9E5875FAA2C', 465, 0, 480049, '', '2012-07-20T16:17:54Z', 'UNKNOWN'),
  ('3A2271EFCC767E4CA40017E68802F10C', 478, 0, 54298, '', '2012-08-01T17:26:38Z', 'UNKNOWN'),
  ('99D0EFC5A9F1AA498DB1A4CDF294129B', 490, 0, 11500, '', '2012-09-18T14:23:13Z', 'ALTER'),
  ('38B2E3A379C5084998E6A84D496AC555', 491, 0, 26088, '', '2012-09-25T06:00:00Z', 'ALTER'),
  ('8902831C8FAD4941841EE2847656BDAF', 494, 0, -825, '', '2012-10-16T14:20:06Z', 'ALTER'),
  ('7AFDB08A002AE54A8DE7699855AEBE30', 495, 0, 221, '', '2012-10-16T14:21:27Z', 'ALTER'),
  ('38A2CCEF5F0B294AA8B8752F461D121D', 496, 0, 0, '', '2012-12-24T01:11:15Z', 'ALTER'),
  ('24CCD5CE409E674593108CBD816DBCCE', 486, 1, -825, '2012-10-01T21:42:52Z', '2012-09-17T20:42:12Z', 'ALTER'),
  ('C7458704E36C8F448C1F3A485EB08304', 485, 1, 10000, '2012-10-01T21:25:56Z', '2012-09-11T21:29:44Z', 'ALTER'),
  ('B511953AE6FB6446A63AA83C159057BE', 487, 1, 82170, '2012-10-01T21:42:51Z', '2012-09-17T20:46:41Z', 'ALTER'),
  ('EC977BC304A971439D04BB9DF4D8188A',488, 1, 15500, '2012-10-01T20:58:15Z', '2012-09-18T06:00:00Z', 'ALTER'),
  ('D9B1F0C0A8E490448697B783639E09E0', 489, 1, 11503, '2012-10-01T21:42:50Z', '2012-09-18T13:56:18Z', 'ALTER'),
  ('698BB6D65832D146A49727C717A591A1', 492, 1, 2787, '2012-10-01T21:10:06Z', '2012-09-25T15:55:02Z', 'ALTER'),
  ('155D4F2B1854B34FABCDE8CF20F1E44C', 493, 1, 12162, '2012-10-01T21:10:06Z', '2012-09-25T16:04:40Z', 'ALTER'),
  ('137C9BF2B1EFD34B8831ADA70C5F9431', 1, 1, 369543, '2011-12-08T13:41:04Z', '1899-12-30T05:00:00Z', 'DRAW'),
  ('7F29FC7114BD10468AE92A047345B5DB', 2, 1, 7258, '2011-12-08T13:41:04Z', '2011-10-20T05:00:00Z', 'DRAW'),
  ('6B66D8EAD88E6E4FA29401CD524B978A', 3, 1, 979321, '2011-12-08T13:41:04Z', '2011-11-08T05:00:00Z', 'DRAW'),
  ('7F393B712B213041A6DD211E04F6DCA6', 4, 1, 14998, '2012-04-20T15:16:21Z', '2012-04-18T21:07:07Z', 'DRAW'),
  ('2255F84E7C7DA04389765724872D6413', 5, 1, 58926, '2012-04-20T15:16:23Z', '2012-04-18T21:13:15Z', 'DRAW'),
  ('DB4A5588DEB9F34C868F7AD1CB13ACC3', 6, 1, 13232, '2012-04-20T15:16:05Z', '2012-04-18T21:17:00Z', 'DRAW'),
  ('B5231AE40F8E7D41BA0A4D09614CBDF9', 7, 1, 10176, '2012-04-20T15:16:25Z', '2012-04-18T21:19:41Z', 'DRAW'),
  ('2362D54FCC53E447AC7D8289EA89FD05', 8, 1, 17556, '2012-04-20T15:16:04Z', '2012-04-18T21:21:20Z', 'DRAW'),
  ('6ED4565CA041704B8D006EDA4A1E4CF9', 9, 1, 399639, '2012-05-30T16:32:43Z', '2012-05-17T06:00:00Z', 'DRAW'),
  ('B21BE07E3E42C2418C70AD17862D3AE1', 10, 1, 6231, '2012-08-16T16:55:00Z', '2012-08-02T16:02:03Z', 'DRAW'),
  ('8FD252A50137754A98698F93AC9B01A7', 11, 1, 629, '2012-08-16T16:54:58Z', '2012-08-02T16:07:57Z', 'DRAW'),
  ('1B9AFD2C20362F48A486E8A535B29AF5', 20, 1, -113810, '2011-12-13T17:15:53Z', '2010-02-10T05:00:00Z', 'UNKNOWN');

Here's the query:

SELECT 
    a.[GUID], 
    [positive_previous_total] = SUM(CASE WHEN b.Value>0 THEN b.Value ELSE 0 END), 
    [negative_previous_total] = SUM(CASE WHEN b.Value<0 THEN b.Value ELSE 0 END) 
FROM 
    #Changes a 
    LEFT OUTER JOIN #Changes b 
        ON 
        b.[GUID]    <> a.[GUID] AND
        b.Approved  = a.Approved AND
        b.Category  = a.Category 
        AND 
        (
        ISNULL(SUBSTRING(CASE WHEN b.Approved=1 THEN b.ApprovedDate ELSE b.IssuedDate END, 1, 10), '0000-00-00') 
          < ISNULL(SUBSTRING(CASE WHEN a.Approved=1 THEN a.ApprovedDate ELSE a.IssuedDate END, 1, 10), '0000-00-00')
        OR 
            (
            ISNULL(SUBSTRING(CASE WHEN b.Approved=1 THEN b.ApprovedDate ELSE b.IssuedDate END, 1, 10), '0000-00-00')
              =ISNULL(SUBSTRING(CASE WHEN a.Approved=1 THEN a.ApprovedDate ELSE a.IssuedDate END, 1, 10), '0000-00-00') 
            AND 
            b.DocumentNumber<a.DocumentNumber
            )
        ) 
GROUP BY a.[GUID]

It's quick on this number of records but when expanded to 700 recods it takes a couple of seconds which we'd like to bring down to 0.5s.

HERE IS A LIVE EXAMPLE OF THE SCHEMA AND SCRIPT ON SQL FIDDLE

回答1:

Personally, I would make use of Computed Columns to remove the complexity in the query.

For example: you table definition could become (noting that I have corrected the data types here also):

CREATE TABLE #Changes 
(
    [GUID] varchar(250), 
    Value numeric(36,6), 
    DocumentNumber varchar(250), 
    Approved numeric(36,6), -- Is there any reason this is not a BIT field?
    ApprovedDate datetime,
    ApprovedDate_NoTime AS (CASE WHEN ApprovedDate IS NULL THEN CONVERT(DATETIME, '0000-00-00') ELSE DATEADD(DAY, DATEDIFF(DAY, 0, ApprovedDate), 0) END) PERSISTED
    IssuedDate datetime, 
    IssuedDate_NoTime AS (CASE WHEN IssuedDate IS NULL THEN CONVERT(DATETIME, '0000-00-00') ELSE DATEADD(DAY, DATEDIFF(DAY, 0, IssuedDate), 0) END) PERSISTED
    ApprovedOrIssuedDate AS (CASE WHEN Approved = 1 THEN ApprovedDate_NoTime ELSE IssuedDate_NoTime) PERSISTED,
    Category varchar(250)
);

Then you can re-write your query to be:

SELECT 
    a.[GUID], 
    [positive_previous_total] = SUM(CASE WHEN b.Value>0 THEN b.Value ELSE 0 END), 
    [negative_previous_total] = SUM(CASE WHEN b.Value<0 THEN b.Value ELSE 0 END) 
FROM 
    #Changes a 
    LEFT OUTER JOIN #Changes b 
        ON 
        b.[GUID]    <> a.[GUID] AND
        b.Approved  = a.Approved AND
        b.Category  = a.Category 
        AND 
        (
            b.ApprovedOrIssuedDate < a.ApprovedOrIssuedDate
            OR 
            (
                b.ApprovedOrIssuedDate = a.ApprovedOrIssuedDate
                AND 
                b.DocumentNumber < a.DocumentNumber
            )
        ) 
GROUP BY a.[GUID]

Is there any reason why Approved is not a bit field? I have changed the data types to datetime and provided you with a column to zero out the time portion of date. Also, not tested, but you get the idea.

I would also read this and this (Assuming you want to ignore time portion of datetime in the comparisons)



回答2:

OK, a few things:

1. Never, Ever Store Dates as Strings.

Unless this really is a #temporary table in your database (in which case there's some other questions/issues) you should absolutely never store a true date as a string. This will cause nothing but problems down the road.

In 30+ years of consulting I have seen literally hundreds of databases with dates stored as strings, and every single one of them had invalid date-strings in them.

2. Always Add Appropriate Keys and Indexes To Your Tables.

This applies even to #Temp tables unless they are small or you know for sure that they will not help.

In your case you probably ought to have a Unique/Primary Key on the GUID. And for performance, you should have an index on {GUID, Approved, Category} (possibly Clustered).



回答3:

I tried a few things - here's the scoop:

  • I added a clustered index on a new INT IDENTITY primary key column to the temp table - yes, that seems contraproductive, but in many cases, it's actually not; it does speed up lots of things - even inserts and deletes! See The Clustered Index Debate Continues... by Kimberly Tripp for a point-by-point run-down as to why that's the case

  • I also made the ApprovedDate and IssuedDate to be actual DATETIME datatypes - not varchars. If it feels like a date, looks like a date, quacks like a date - then it's a date and should be stored as such!

    See Bad habits to kick : choosing the wrong data type - you should always use the most appropriate data type - that's what they're there for, after all!

  • I added an index on the columns that are used as foreign keys in the JOIN to speed up the query

  • I added a computed column that encapsulates the whole if it's approved, use the ApprovedDate, otherwise the IssuedDate logic into one place - makes the query much easier to read! Since this is a pseudo-DATE (time portion is zeroed out), this basically takes care of all the ugly CONVERT / ISNULL and SUBSTRING statements you had (repeatedly) in the query.

So here's my changed script to create the temp table

CREATE TABLE #Changes 
    (
    ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
    [GUID] varchar(250), 
    Value numeric(36,6), 
    DocumentNumber varchar(250), 
    Approved numeric(36,6), 
    ApprovedDate DATETIME,
    IssuedDate DATETIME, 
    Category varchar(250),

    ApprovedOrIssuedDate AS CASE 
                               WHEN Approved = 1 
                               THEN DATEADD(DAY, 0, DATEDIFF(DAY, 0, ApprovedDate))
                               ELSE DATEADD(DAY, 0, DATEDIFF(DAY, 0, IssuedDate))
                            END PERSISTED
    );

CREATE NONCLUSTERED INDEX IX_Index01 ON #Changes([GUID], Approved, Category) 
                                     INCLUDE(DocumentNumber, ApprovedDate, IssuedDate)

and then your query becomes much easier:

SELECT 
    a.[GUID], 
    [positive_previous_total] = SUM(CASE WHEN b.Value > 0 THEN b.Value ELSE 0 END), 
    [negative_previous_total] = SUM(CASE WHEN b.Value < 0 THEN b.Value ELSE 0 END) 
FROM 
    #Changes a 
LEFT OUTER JOIN 
    #Changes b ON b.[GUID] <> a.[GUID] 
               AND b.Approved  = a.Approved 
               AND b.Category  = a.Category 
               AND 
                  (b.ApprovedOrIssuedDate < a.ApprovedOrISsuedDate
                   OR 
                   (b.ApprovedOrIssuedDate = a.ApprovedOrIssuedDate
                    AND b.DocumentNumber < a.DocumentNumber)
                  ) 
GROUP BY 
     a.[GUID]

In my measurements, I got a nice and quite significant improvement in the query cost (down from 0.022 to about 0.0146)



回答4:

Fundamentally, you are trying to do a cumulative sum. In versions of SQL Server prior to 2012, you have to do it using a join the way you are doing it (or a correlated subuqery, which should have a similar execution plan). I simplified your query to this:

SELECT a.[GUID], 
       [positive_previous_total] = SUM(CASE WHEN b.Value>0 THEN b.Value ELSE 0 END), 
       [negative_previous_total] = SUM(CASE WHEN b.Value<0 THEN b.Value ELSE 0 END) 
FROM #Changes a LEFT OUTER JOIN
     #Changes b 
     ON b.[GUID]    <> a.[GUID] AND
         b.Approved  = a.Approved AND
         b.Category  = a.Category and
         ((b.ApprovedDate < a.ApprovedDate and a.Approved = 1) or
          (b.IssuedDate < a.IssuedDate and a.Approved <> 1)
         ) or
         ((b.ApprovedDate = a.ApprovedDate and a.Approved = 1 and b.DocumentNumber<a.DocumentNumber) or
          (b.IssuedDate = a.IssuedDate and a.Approved <> 1 and b.DocumentNumber<a.DocumentNumber)
         ) 
        ) 
GROUP BY a.[GUID]

The isNULL is unnecessary. In the definition of #Change, you should simply replace the value with what you want. Also, because a.Approved = b.Approved by the join condition, you don't need the case statement.

Since you are creating the #Changes table, you should add a DateDoc key to it. This would have the date -- as definied as either ApprovedDate or IssueDate or appropriate NULL date -- with the document number concatenated on it. The document number should be padded on the left with 0s. The date format should be YYYYMMDD.

Now, you can write the from clause as:

FROM #Changes a LEFT OUTER JOIN
     #Changes b 
     ON b.[GUID]    <> a.[GUID] AND
         b.Approved  = a.Approved AND
         b.Category  = a.Category and
         b.datedoc < a.datedoc

With this structure, I'm thinking that an index on GUID, Approved, Category, and Datedoc would help the query. I'm not sure if adding value at the end would help. But, your data should fit in memory because you do not have very many rows.