How do I optimize Upsert (Update and Insert) opera

2019-01-19 00:40发布

问题:

I am not a DBA but I do work for a small company as the IT person. I have to replicate a database from staging to production. I have created an SSIS package to do this but it takes hours to run. This isn't a large data warehouse type of project, either, it's a pretty straightforward Upsert. I'm assuming that I am the weak link in how I designed it.

Here's my procedure:

  1. Truncate staging tables (EXECUTE SQL TASK)
  2. Pull data from a development table into staging (Data Flow Task)
  3. Run a data flow task
    1. OLE DB Source
    2. Conditional Split Transformation (Condition used: [!]ISNULL(is_new_flag))
    3. If new insert, if existing update

The data flow task is mimicked a few times to change tables/values but the flow is the same. I've read several things about OLE DB components being slow to updates being slow and have tried a few things but haven't gotten it to run very quickly.

I'm not sure what other details to give, but I can give anything that's asked for.

回答1:

Sample package using SSIS 2008 R2 that inserts or updates using batch operation:

Here is a sample package written in SSIS 2008 R2 that illustrates how to perform insert, update between two databases using batch operations.

  • Using OLE DB Command will slow down the update operations on your package because it does not perform batch operations. Every row is updated individually.

The sample uses two databases namely Source and Destination. In my example, both the databases reside on the server but the logic can still be applied for databases residing on different servers and locations.

I created a table named dbo.SourceTable in my source database Source.

CREATE TABLE [dbo].[SourceTable](
    [RowNumber] [bigint] NOT NULL,
    [CreatedOn] [datetime] NOT NULL,
    [ModifiedOn] [datetime] NOT NULL,
    [IsActive] [bit] NULL
)

Also, created two tables named dbo.DestinationTable and dbo.StagingTable in my destination database Destination.

CREATE TABLE [dbo].[DestinationTable](
    [RowNumber] [bigint] NOT NULL,
    [CreatedOn] [datetime] NOT NULL,
    [ModifiedOn] [datetime] NOT NULL
) 
GO

CREATE TABLE [dbo].[StagingTable](
    [RowNumber] [bigint] NOT NULL,
    [CreatedOn] [datetime] NOT NULL,
    [ModifiedOn] [datetime] NOT NULL
) 
GO

Inserted about 1.4 million rows in the table dbo.SourceTable with unique values into RowNumber column. The tables dbo.DestinationTable and dbo.StagingTable were empty to begin with. All the rows in the table dbo.SourceTable have the flag IsActive set to false.

Created an SSIS package with two OLE DB connection managers, each connecting to Source and Destination databases. Designed the Control Flow as shown below:

  • First Execute SQL Task executes the statement TRUNCATE TABLE dbo.StagingTable against the destination database to truncate the staging tables.

  • Next section explains how the Data Flow Task is configured.

  • Second Execute SQL Task executes the below given SQL statement that updates data in dbo.DestinationTable using the data available in dbo.StagingTable, assuming that there is a unique key that matches between those two tables. In this case, the unique key is the column RowNumber.

Script to update:

UPDATE      D 
SET         D.CreatedOn = S.CreatedOn
        ,   D.ModifiedOn = S.ModifiedOn 
FROM        dbo.DestinationTable D 
INNER JOIN  dbo.StagingTable S 
ON          D.RowNumber = S.RowNumber

I have designed the Data Flow Task as shown below.

  • OLE DB Source reads data from dbo.SourceTable using the SQL command SELECT RowNumber,CreatedOn, ModifiedOn FROM Source.dbo.SourceTable WHERE IsActive = 1

  • Lookup transformation is used to check if the RowNumber value already exists in the table dbo.DestinationTable

  • If the record does not exist, it will be redirected to the OLE DB Destination named as Insert into destination table, which inserts the row into dbo.DestinationTable

  • If the record exists, it will be redirected to the OLE DB Destination named as Insert into staging table, which inserts the row into dbo.StagingTable. This data in staging table will be used in the second `Execute SQL Task to perform batch update.

To activate few more rows for OLE DB Source, I ran the below query to activate some records

UPDATE  dbo.SourceTable 
SET     IsActive = 1 
WHERE   (RowNumber % 9 = 1) 
OR      (RowNumber % 9 = 2)

First execution of the package looked as shown below. All the rows were directed to destination table because it was empty. The execution of the package on my machine took about 3 seconds.

Ran the row count query again to find the row counts in all three table.

To activate few more rows for OLE DB Source, I ran the below query to activate some records

UPDATE  dbo.SourceTable 
SET     IsActive = 1 
WHERE   (RowNumber % 9 = 3) 
OR      (RowNumber % 9 = 5) 
OR      (RowNumber % 9 = 6) 
OR      (RowNumber % 9 = 7)

Second execution of the package looked as shown below. 314,268 rows that were previously inserted during first execution were redirected to staging table. 628,766 new rows were directly inserted into the destination table. The execution of the package on my machine took about 12 seconds. 314,268 rows in destination table were updated in the second Execute SQL Task with the data using staging table.

Ran the row count query again to find the row counts in all three table.

I hope that gives you an idea to implement your solution.



回答2:

The two things I'd look at are your inserts (ensure you are using either the "Table or View - fast load" or "Table name or view name variable - fast load") and your updates.

As you have correctly determined, the update logic is usually where performance falls down and that is due to the OLE DB component firing singleton updates for each row flowing through it. The usual approach people take to overcome this is to write all the updates to a staging table, much as your Insert logic does. Then follow up your Data Flow Task with an Execute SQL Task to perform a bulk Update.

If you are in the mind of acquiring 3rd party tools, PragmaticWorks offers an Upsert destination