I am not a DBA but I do work for a small company as the IT person. I have to replicate a database from staging to production. I have created an SSIS package to do this but it takes hours to run. This isn't a large data warehouse type of project, either, it's a pretty straightforward Upsert
. I'm assuming that I am the weak link in how I designed it.
Here's my procedure:
- Truncate staging tables (
EXECUTE SQL TASK
)
- Pull data from a development table into staging (
Data Flow Task
)
- Run a data flow task
OLE DB Source
Conditional Split Transformation
(Condition used: [!]ISNULL(is_new_flag)
)
- If new insert, if existing update
The data flow task is mimicked a few times to change tables/values but the flow is the same. I've read several things about OLE DB components being slow to updates being slow and have tried a few things but haven't gotten it to run very quickly.
I'm not sure what other details to give, but I can give anything that's asked for.
Sample package using SSIS 2008 R2 that inserts or updates using batch operation:
Here is a sample package written in SSIS 2008 R2
that illustrates how to perform insert, update between two databases using batch operations.
- Using
OLE DB Command
will slow down the update operations on your package because it does not perform batch operations. Every row is updated individually.
The sample uses two databases namely Source
and Destination
. In my example, both the databases reside on the server but the logic can still be applied for databases residing on different servers and locations.
I created a table named dbo.SourceTable
in my source database Source
.
CREATE TABLE [dbo].[SourceTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL,
[IsActive] [bit] NULL
)
Also, created two tables named dbo.DestinationTable
and dbo.StagingTable
in my destination database Destination
.
CREATE TABLE [dbo].[DestinationTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
CREATE TABLE [dbo].[StagingTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
Inserted about 1.4 million rows in the table dbo.SourceTable
with unique values into RowNumber
column. The tables dbo.DestinationTable
and dbo.StagingTable
were empty to begin with. All the rows in the table dbo.SourceTable
have the flag IsActive
set to false.
Created an SSIS package with two OLE DB connection managers, each connecting to Source
and Destination
databases. Designed the Control Flow as shown below:
First Execute SQL Task
executes the statement TRUNCATE TABLE dbo.StagingTable
against the destination database to truncate the staging tables.
Next section explains how the Data Flow Task
is configured.
Second Execute SQL Task
executes the below given SQL statement that updates data in dbo.DestinationTable
using the data available in dbo.StagingTable
, assuming that there is a unique key that matches between those two tables. In this case, the unique key is the column RowNumber
.
Script to update:
UPDATE D
SET D.CreatedOn = S.CreatedOn
, D.ModifiedOn = S.ModifiedOn
FROM dbo.DestinationTable D
INNER JOIN dbo.StagingTable S
ON D.RowNumber = S.RowNumber
I have designed the Data Flow Task as shown below.
OLE DB Source
reads data from dbo.SourceTable
using the SQL command SELECT RowNumber,CreatedOn, ModifiedOn FROM Source.dbo.SourceTable WHERE IsActive = 1
Lookup transformation
is used to check if the RowNumber value already exists in the table dbo.DestinationTable
If the record does not exist, it will be redirected to the OLE DB Destination
named as Insert into destination table
, which inserts the row into dbo.DestinationTable
If the record exists, it will be redirected to the OLE DB Destination
named as Insert into staging table
, which inserts the row into dbo.StagingTable
. This data in staging table will be used in the second `Execute SQL Task to perform batch update.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 1)
OR (RowNumber % 9 = 2)
First execution of the package looked as shown below. All the rows were directed to destination table because it was empty. The execution of the package on my machine took about 3 seconds
.
Ran the row count query again to find the row counts in all three table.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 3)
OR (RowNumber % 9 = 5)
OR (RowNumber % 9 = 6)
OR (RowNumber % 9 = 7)
Second execution of the package looked as shown below. 314,268 rows
that were previously inserted during first execution were redirected to staging table. 628,766 new rows
were directly inserted into the destination table. The execution of the package on my machine took about 12 seconds
. 314,268 rows
in destination table were updated in the second Execute SQL Task with the data using staging table.
Ran the row count query again to find the row counts in all three table.
I hope that gives you an idea to implement your solution.
The two things I'd look at are your inserts (ensure you are using either the "Table or View - fast load" or "Table name or view name variable - fast load") and your updates.
As you have correctly determined, the update logic is usually where performance falls down and that is due to the OLE DB component firing singleton updates for each row flowing through it. The usual approach people take to overcome this is to write all the updates to a staging table, much as your Insert logic does. Then follow up your Data Flow Task
with an Execute SQL Task
to perform a bulk Update.
If you are in the mind of acquiring 3rd party tools, PragmaticWorks offers an Upsert destination