Finding updated records in SSIS — to hash or not t

2019-01-20 07:01发布

问题:

I'm working on migrating data from a table in a DB2 database to our SQL Server database using SSIS. The table that I am pulling data from contains a respectable amount of data--a little less than 100,000 records; but, it also has 46 columns.

I only want to update the rows that NEED to be updated, and so I came to conclusion that I could either use a Lookup Transformation and check all 46 columns and redirect the "no matches" to be updated on the SQL table. Or, I could hash each row in the datasets after I read the data in at the beginning of my data task flow, and then, subsequently, use the hash values as a comparison later on when determining if the rows are equal or not.

My question would be: Which is the better route to take? I like hashing them, but I'm not sure if that is the best route to take. Does anyone have any pearls of wisdom they'd like to share?

回答1:

Why not both?

Generally speaking, there are two things we look for when doing an incremental load: Does this exist? If it exists, has it changed. If there's a single column, it's trivial. When there are many columns to check, that becomes quite the pain, especially if you're using SSIS to map all those columns and/or have to deal with worrying about NULLs.

I solve the multicolumn problem by cheating - I create two columns in all my tables: HistoricalHashKey and ChangeHashKey. Historical hash key will be all the business keys. Change hash key is all the rest of the material columns (I'd exclude things like audit columns). We are not storing the concatenated values directly in our hash columns. Instead, "we're going Math the stuff out of it" and apply a hashing algorithm called SHA-1. This algorithm will take all the input columns and return a 20 byte output.

There are three caveats to using this approach. You must concatenate the columns in the same order every time. These will be case sensitive. Trailing space is significant. That's it.

In your tables, you would add those the two columns as binary(20) NOT NULL.

Set up

Your control flow would look something like this

and your data flow something like this

OLESRC Incremental Data

(Assume I'm sourced from Adventureworks2014, Production.Product) I'm going to use the CONCAT function from SQL Server 2012+ as it promotes all data types to string and is NULL safe.

SELECT
    P.ProductID
,   P.Name
,   P.ProductNumber
,   P.MakeFlag
,   P.FinishedGoodsFlag
,   P.Color
,   P.SafetyStockLevel
,   P.ReorderPoint
,   P.StandardCost
,   P.ListPrice
,   P.Size
,   P.SizeUnitMeasureCode
,   P.WeightUnitMeasureCode
,   P.Weight
,   P.DaysToManufacture
,   P.ProductLine
,   P.Class
,   P.Style
,   P.ProductSubcategoryID
,   P.ProductModelID
,   P.SellStartDate
,   P.SellEndDate
,   P.DiscontinuedDate

,   P.rowguid
,   P.ModifiedDate

-- Hash my business key(s)
,   CONVERT(binary(20), HASHBYTES('MD5',
            CONCAT
            (
            -- Having an empty string as the first argument
            -- allows me to simplify building of column list
                ''
            ,   P.ProductID
            )
        )
    ) AS HistoricalHashKey

-- Hash the remaining columns
,   CONVERT(binary(20), HASHBYTES('MD5',
            CONCAT
            (      
                ''
            ,   P.Name
            ,   P.ProductNumber
            ,   P.MakeFlag
            ,   P.FinishedGoodsFlag
            ,   P.Color
            ,   P.SafetyStockLevel
            ,   P.ReorderPoint
            ,   P.StandardCost
            ,   P.ListPrice
            ,   P.Size
            ,   P.SizeUnitMeasureCode
            ,   P.WeightUnitMeasureCode
            ,   P.Weight
            ,   P.DaysToManufacture
            ,   P.ProductLine
            ,   P.Class
            ,   P.Style
            ,   P.ProductSubcategoryID
            ,   P.ProductModelID
            ,   P.SellStartDate
            ,   P.SellEndDate
            ,   P.DiscontinuedDate
            )
        )
    ) AS ChangeHashKey
FROM
    Production.Product AS P;

LKP Check Existence

This query will pull back the stored HistoricalHashKey and ChangeHashKey from our reference table.

SELECT
    DP.HistoricalHashKey
,   DP.ChangeHashKey
FROM
    dbo.DimProduct AS DP;

At this point, it's a simple matter to compare the HistoricalHashKeys to determine whether the row exists. If we match, we want to pull back the ChangeHashKey into our Data Flow. By convention, I name this lkp_ChangeHashKey to differentiate from the source ChangeHashKey.

CSPL Change Detection

The conditional split is also simplified. Either the two Change Hash keys match (no change) or they don’t (changed). That expression would be

ChangeHashKey == lkp_ChangeHashKey

OLE_DST StagedUpdates

Rather than use the OLE DB Command, create a dedicated table for holding the rows that need to be updated. OLE DB Command does not scale well as behind the scenes it issues singleton update commands.

SQL Perform Set Based Updates

After the data flow is complete, all the data that needs updating will be in our staging table. This Execute SQL Task simply updates the existing data matching on our business keys.

UPDATE
    TGT
SET
    Name = SRC.name
,   ProductNumber = SRC.
FROM
    dbo.DimProduct AS TGT
    INNER JOIN
        Stage.DimProduct AS SRC
        ON SRC.HistoricalHashKey = TGT.HistoricalHashKey;
        -- If clustered on a single column and table is large, this will yield better performance
        -- ON SRC.DimProductSK = TGT.DimProductSK;

From the comments

Why do I use dedicated INSERT and UPDATE statements since we have the shiny MERGE? Besides not remembering the syntax as easily, the SQL Server implementation can have some ... unintended consequences. They may be cornerish cases but I'd rather not run into them with the solutions I deliver. Explicit INSERT and UPDATE statements give me the fine grained control I want and need in my solutions. I love SQL Server, think it's a fantastic product but they weird syntax coupled with known bugs keeps me from using MERGE anywhere but a certification exam.



标签: ssis