How to convert Row by row execution in to SET base

2019-08-09 08:24发布

问题:

I'm working on a huge SQL code and unfortunately it has a CURSOR which handles another two nested CURSORS within it (totally three cursors inside a stored procedure), which handles millions of data to be DELETE,UPDATE and INSERT. This takes a whole lot of time because of row by row execution and I wish to modify this in to SET based approach

From many articles it shows use of CURSORs is not recommend and the alternate is to use WHILE loops instead, So I tried and replaced the three CUROSRs with three WHILE loops nothing more, though I get the same result but there is no improvement in performance, it took the same time as it took for CUROSRs.

Below is the basic structure of the code I'm working on (i Will try to put as simple as possible) and I will put the comments what they are supposed to do.

declare @projects table (
    ProjectID INT,
    fieldA int,
    fieldB int,
    fieldC int,
    fieldD int)

    INSERT INTO @projects
    SELECT ProjectID,fieldA,fieldB,fieldC, fieldD
    FROM ProjectTable

    DECLARE projects1 CURSOR LOCAL FOR /*First cursor - fetch the cursor from ProjectaTable*/
        Select ProjectID FROM @projects

    OPEN projects1
    FETCH NEXT FROM projects1 INTO @ProjectID
    WHILE @@FETCH_STATUS = 0
    BEGIN
    BEGIN TRY
        BEGIN TRAN

        DELETE FROM T_PROJECTGROUPSDATA td
        WHERE td.ID = @ProjectID

        DECLARE datasets CURSOR FOR /*Second cursor - this will get the 'collectionDate'field from datasetsTable for every project fetched in above cursor*/
            Select DataID, GroupID, CollectionDate
            FROM datasetsTable 
            WHERE datasetsTable.projectID = @ProjectID /*lets say this will fetch ten records for a single projectID*/

        OPEN datasets
        FETCH NEXT FROM datasets INTO @DataID, @GroupID, @CollectionDate
        WHILE @@FETCH_STATUS = 0
        BEGIN

            DECLARE period CURSOR FOR /*Third Cursor - this will process the records from another table called period with above fetched @collectionDate*/
            SELECT ID, dbo.fn_GetEndOfPeriod(ID) 
            FROM T_PERIODS
            WHERE DATEDIFF(dd,@CollectionDate,dbo.fn_GetEndOfPeriod(ID)) >= 0 /*lets say this will fetch 20 records for above fetched single @CollectionDate*/
            ORDER BY [YEAR],[Quarter]

                OPEN period
                FETCH NEXT FROM period INTO @PeriodID, @EndDate
                WHILE @@FETCH_STATUS = 0
                BEGIN
                    IF EXISTS (some conditions No - 1 )
                    BEGIN
                        BREAK
                    END
                    IF EXISTS (some conditions No - 2 )
                    BEGIN
                        FETCH NEXT FROM period INTO @PeriodID, @EndDate
                        CONTINUE
                    END

                    /*get the appropirate ID from T_uploads table for the current projectID and periodID fetched*/
                    SET @UploadID = (SELECT ID FROM T_UPLOADS u WHERE  u.project_savix_ID = @ProjectID AND u.PERIOD_ID = @PeriodID AND u.STATUS = 3)

                    /*Update some fields in T_uploads table for the current projectID and periodID fetched*/
                    UPDATE T_uploads
                    SET fieldA = mp.fieldA, fieldB = mp.fieldB
                    FROM @projects mp
                    WHERE T_UPLOADS.ID = @UploadID AND mp.ProjectID = @ProjectID

                    /*Insert some records in T_PROJECTGROUPSDATA table for the current projectID and periodID fetched*/
                    INSERT INTO T_PROJECTGROUPSDATA tpd ( fieldA,fieldB,fieldC,fieldD,uploadID)
                    SELECT fieldA,fieldB,fieldC,fieldD,@UploadID
                    FROM @projects
                    WHERE tpd.DataID = @DataID

                FETCH NEXT FROM period INTO @PeriodID, @EndDate
                END
                CLOSE period
                DEALLOCATE period

            FETCH NEXT FROM datasets INTO @DataID, @GroupID, @CollectionDate, @Status, @Createdate
        END

        CLOSE datasets
        DEALLOCATE datasets

        COMMIT
    END TRY

    BEGIN CATCH
        Error handling
        IF @@TRANCOUNT > 0
            ROLLBACK
    END CATCH
    FETCH NEXT FROM projects1 INTO @ProjectID, @FAID
END


CLOSE projects1
DEALLOCATE projects1

SELECT 1 as success

I request you to suggest any methods to rewrite this code to follow the SET based approach.

回答1:

Until the table structure and expected result sample data is not provided, here are a few quick things I see that can be improved (some of those are already mentioned by others above):

  1. WHILE Loop is also a cursor. So, changing into to while loop is not going make things any faster.
  2. Use LOCAL FAST_FORWARD cursor unless you have need to back track a record. This would make the execution much faster.
  3. Yes, I agree that having a SET based approach would be the fastest in most cases, however if you must store intermediate resultset somewhere, I would suggest using a temp table instead of a table variable. Temp table is 'lesser evil' between these 2 options. Here are a few reason why you should try to avoid using a table variable:

    • Since SQL Server would not have any prior statistics on the table variable during building on Execution Plan, it will always consider that only one record would be returned by the table variable during construction of the execution plan. And accordingly Storage Engine would assign only as much RAM memory for execution of the query. But in reality, there could be millions of records that the table variable might hold during execution. If that happens, SQL Server would be forced spill the data to hard disk during execution (and you will see lots of PAGEIOLATCH in sys.dm_os_wait_stats) making the queries way slower.
    • One way to get rid of the above issue would be by providing statement level hint OPTION (RECOMPILE) at the end of each query where a table value is used. This would force SQL Server to construct the Execution Plan of those queries each time during runtime and the less memory allocation issue can be avoided. However the downside of this is: SQL Server will no longer be able to take advantage of an already cached execution plan for that stored procedure, and would require recompilation every time, which would deteriorate the performance by some extent. So, unless you know that data in the underlying table changes frequently or the stored procedure itself is not frequently executed, this approach is not recommended by Microsoft MVPs.


回答2:

Replacing Cursor with While blindly, is not a recommended option, hence it would not impact your performance and might even have negative impact on the performance.

When you define the cursor using Declare C Cursor in fact you are going to create a SCROLL cursor which specifies that all fetch options (FIRST, LAST, PRIOR, NEXT, RELATIVE, ABSOLUTE) are available.

When you need just Fetch Next as scroll option, you can declare the cursor as FAST_FORWARD

Here is the quote about FAST_FORWARD cursor in Microsoft docs:

Specifies that the cursor can only move forward and be scrolled from the first to the last row. FETCH NEXT is the only supported fetch option. All insert, update, and delete statements made by the current user (or committed by other users) that affect rows in the result set are visible as the rows are fetched. Because the cursor cannot be scrolled backward, however, changes made to rows in the database after the row was fetched are not visible through the cursor. Forward-only cursors are dynamic by default, meaning that all changes are detected as the current row is processed. This provides faster cursor opening and enables the result set to display updates made to the underlying tables. While forward-only cursors do not support backward scrolling, applications can return to the beginning of the result set by closing and reopening the cursor.

So you can declare your cursors using DECLARE <CURSOR NAME> FAST_FORWARD FOR ... and you will get noticeable improvements



回答3:

I think all the cursors code above can be simplified to something like this:

DROP TABLE IF EXISTS #Source;
SELECT DISTINCT p.ProjectID,p.fieldA,p.fieldB,p.fieldC,p.fieldD,u.ID AS [UploadID]
INTO #Source
FROM ProjectTable p
INNER JOIN DatasetsTable d ON d.ProjectID = p.ProjectID
INNER JOIN T_PERIODS s ON DATEDIFF(DAY,d.CollectionDate,dbo.fn_GetEndOfPeriod(s.ID)) >= 0 
INNER JOIN T_UPLOADS u ON u.roject_savix_ID = p.ProjectID AND u.PERIOD_ID = s.ID AND u.STATUS = 3
WHERE NOT EXISTS (some conditions No - 1)
    AND NOT EXISTS (some conditions No - 2)
;

UPDATE u SET u.fieldA = s.fieldA, u.fieldB = s.fieldB
FROM T_UPLOADS u
INNER JOIN #Source s ON s.UploadID = u.ID
;
INSERT INTO T_PROJECTGROUPSDATA (fieldA,fieldB,fieldC,fieldD,uploadID)
SELECT DISTINCT s.fieldA,s.fieldB,s.fieldC,s.fieldD,s.UploadID
FROM #Source s
;

DROP TABLE IF EXISTS #Source;

Also it would be nice to know "some conditions No" details as query can differ depends on that.