I have a problem - I wish to dowload customer data via SSIS (Microsoft ETL). For each customer record I import, I have to check against existing customer records in my target DB. If the record exists, I have to UPDATE the record, and if it doesn't I have to INSERT it. What is the strategy for the best performance? We're talking 20-30 million rows.
I thought of triggers (trigger on INSERT - check for existence of record), lookup, cursors, but I'm sure there's a better way, probably using one of the Data Flow Transformations in SSIS?
Or is it better I import this to a staging table and then do the lookup locally?
Here is one possible option that you can apply to load data into destination table whether it involves insert/update. The example given below loads a text file containing 1 million rows into an SQL table in under 3 minutes. If the record doesn't exist in the SQL table then it will be inserted into destination table else the record will be inserted into a staging table which will then be used to update the destination table with the help of a stored procedure. The stored procedure will also delete any records that might be present in SQL table but not present in the flat file. The example uses SSIS 2008 R2 with SQL Server 2008 R2 backend.
Step-by-step process:
In the SQL Server database, create two tables namely
dbo.ItemInfo
anddbo.Staging
. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1.ItemInfo
will hold the actual data andStaging
table will hold the staging data to compare and update the actual records.Id
column in both these tables is an auto-generated unique identity column.IsProcessed
column in the table ItemInfo will be used to identify and delete the records that are no longer valid.Create an SSIS package and create 5 variables as shown in screenshot #2. I have used
.txt
extension for the tab delimited files and hence the value*.txt
in the variable FileExtension.FilePath
variable will be assigned with value during run-time.FolderLocation
variable denotes where the files will be located.SQLPostLoad
andSQLPreLoad
variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.
Configure the Control Flow Task as shown in screenshot #6. Configure the tasks
Pre Load
,Post Load
andLoop Files
as shown in screenshots #7 - #10. Pre Load will truncate staging table and setIsProcessed
flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in theseExecute SQL
tasks.Double-click on the Load Items data flow task and configure it as shown in screenshot #11.
Read File
is a flat file source configured to use the flat file connection.Row Count
is derived column transformation and its configuration is shown in screenshto #12.Check Exist
is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected toDestination Split
on the left side. Lookup Match Output is redirected toStaging Split
on the left side.Destination Split
andStaging Split
have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.All the destination tasks 0 - 8 are configured to insert data into table
dbo.ItemInfo
as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data intodbo.Staging
as shown in screenshot #18.On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.
Test scenarios:
Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.
Loaded a flat file with
1 million rows
. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.
Loaded the same file containing the
1 million rows
after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.The above answer is copy of the answer provided to this Stack Overflow question as well.
Hope that helps.
Test queries: .
Flat File Connection Columns .
Scripts: (to create both tables and stored procedures) .
Screenshot #1:
Screenshot #2:
Screenshot #3:
Screenshot #4:
Screenshot #5:
Screenshot #6:
Screenshot #7:
Screenshot #8:
Screenshot #9:
Screenshot #10:
Screenshot #11:
Screenshot #12:
Screenshot #13:
Screenshot #14:
Screenshot #15:
Screenshot #16:
Screenshot #17:
Screenshot #18:
Screenshot #19:
Screenshot #20:
Screenshot #21:
Screenshot #22:
Screenshot #23: