SSIS - Script Component, Split single row to multi

2019-05-01 23:46发布

问题:

Thanks in advance for your help. I'm in need of help on writing SSIS script component to delimit single row to multiple rows. There were many helpful blog and post I looked at below:

http://beyondrelational.com/ask/public/questions/1324/ssis-script-component-split-single-row-to-multiple-rows-parent-child-variation.aspx

http://bi-polar23.blogspot.com/2008/06/splitting-delimited-column-in-ssis.html

However, I need a little extra help on coding to complete the project. Basically here's what I want to do.

Input data


    ID               Item Name
    1                Apple01,02,Banana01,02,03
    2                Spoon1,2,Fork1,2,3,4

Output data


    ParentID      ChildID          Item Name
    1             1                Apple01
    1             2                Apple02
    1             3                Banana01
    1             4                Banana02
    1             5                Banana03
    2             1                Spoon1
    2             2                Spoon2
    2             3                Fork1
    2             4                Fork2
    2             5                Fork3
    2             6                Fork4

Below is my attempt to code, but feel free to revise whole if it's illogic. SSIS Asynchronous output is set.

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    Dim posID As Integer, childID As Integer
    Dim delimiter As String = ","
    Dim txtHolder As String, suffixHolder As String
    Dim itemName As String = Row.ItemName
    Dim keyField As Integer = Row.ID

    If Not (String.IsNullOrEmpty(itemList)) Then

        Dim inputListArray() As String = _
        itemList.Split(New String() {delimiter}, _
        StringSplitOptions.RemoveEmptyEntries)

        For Each item As String In inputListArray
            Output0Buffer.AddRow()
            Output0Buffer.ParentID = keyField

            If item.Length >= 3 Then
                txtHolder = Trim(item)
                Output0Buffer.ItemName = txtHolder

                'when item length is less than 3, it's suffix
            Else
                suffixHolder = Trim(item)
                txtHolder = Left(txtHolder.ToString(), Len(txtHolder) _
                    - Len(suffixHolder)) & suffixHolder.ToString()
                Output0Buffer.ItemName = txtHolder
            End If
        Next
    End If
End Sub

The current code produces the following output

ID           Item Name
1            Apple01
1            02
1            Banana01
1            02
1            03
2            Spoon1
2            2
2            Fork1
2            2
2            3
2            4

回答1:

If I come across as pedantic in this response, it is not my intention. Based on the comment "I'm new at coding and having a problem troubleshooting" I wanted to walk through my observations and how I came to them.

Problem analysis

The desire is to split a single row into multiple output rows based on a delimited field associated to the row.

The code as it stands now is generating the appropriate number of rows so you do have the asynchronous part (split) of the script working so that's a plus. What needs to happen is we need to 1) Populate the Child ID column 2) Apply the item prefix to all subsequent row when generating the child items.

I treat most every problem like that. What am I trying to accomplish? What is working? What isn't working? What needs to be done to make it work. Decomposing problems into smaller and smaller problems will eventually result in something you can do.

Code observations

Pasting in the supplied code resulted in an error that itemList was not declared. Based on usage, it seems that it was intended to be itemName.

After fixing that, you should notice the IDE indicating you have 2 unused variables (posID, childID) and that the variable txHolder is used before it's been assigned a value. A null reference exception could result at runtime. My coworker often remarks warnings are errors that haven't grown up yet so my advice to you as a fledgling developer is to pay attention to warnings unless you explicitly expect the compiler to warn you about said scenario.

Getting started

With a choice between solving the Child ID situation versus the name prefix/suffix stuff, I'd start with an easy one, the child id

Generating a surrogate key

That's the fancy title phrase that if you searched on you'd have plenty of hits to ssistalk or sqlis or any of a number of fabulously smart bloggers. Devil of course is knowing what to search on. No where do you ever compute or assign the child id value to the stream which of course is why it isn't showing up there.

We simply need to generate a monotonically increasing number which resets each time the source id changes. I am making an assumption that the inbound ID is unique in the incoming data like a sales invoice number would be unique and we are splitting out the items purchased. However if those IDs were repeated in the dataset, perhaps instead of representing invoice numbers they are salesperson id. Sales Person 1 could have another row in the batch selling vegetables. That's a more complex scenario and we can revisit if that better describes your source data.

There are two parts to generating our surrogate key (again, break problems down into smaller pieces). The first thing to do is make a thing that counts up from 1 to N. You have defined a childId variable to serve this. Initialize this variable (1) and then increment it inside your foreach loop.

Now that we counting, we need to push that value onto the output stream. Putting those two steps together would look like

        childID = 1
        For Each item As String In inputListArray
            Output0Buffer.AddRow()
            Output0Buffer.ParentId = keyField
            Output0Buffer.ChildId = childID
            ' There might be VB shorthand for ++
            childID = childID + 1

Run the package and success! Scratch the generate surrogate key off the list.

String mashing

I don't know of a fancy term for what needs to be done in the other half of the problem but I needed some title for this section. Given the source data, this one might be harder to get right. You've supplied value of Apple01, Banana01, Spoon1, Fork1. It looks like there's a pattern there (name concatenated with a code) but what it is it? Your code indicates that if it's less than 3, it's a suffix but how do you know what the base is? The first row uses a leading 0 and is two digits long while the second row does not use a leading zero. This is where you need to understand your data. What is the rule for identifying the "code" part of the first row? Some possible algorithms

  • Force your upstream data providers to provide consistent length codes (I think this has worked once in my 13 years but it never hurts to push back against the source)
  • Assuming code is always digits, evaluate each character in reverse order testing whether it can be cast to an integer (Handles variable length codes)
  • Assume the second element in the split array will provide the length of the code. This is the approach you are taking with your code and it actually works.

I made no changes to make the generated item name work beyond fixing the local variables ItemName/itemList. Final code eliminates the warnings by removing PosID and initializing txtHolder to an empty string.

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    Dim childID As Integer
    Dim delimiter As String = ","
    Dim txtHolder As String = String.Empty, suffixHolder As String
    Dim itemName As String = Row.ItemName
    Dim keyField As Integer = Row.ID

    If Not (String.IsNullOrEmpty(itemName)) Then

        Dim inputListArray() As String = _
        itemName.Split(New String() {delimiter}, _
        StringSplitOptions.RemoveEmptyEntries)

        ' The inputListArray (our split out field)
        ' needs to generate values from 1 to N
        childID = 1
        For Each item As String In inputListArray
            Output0Buffer.AddRow()
            Output0Buffer.ParentId = keyField
            Output0Buffer.ChildId = childID
            ' There might be VB shorthand for ++
            childID = childID + 1

            If item.Length >= 3 Then
                txtHolder = Trim(item)
                Output0Buffer.ItemName = txtHolder
            Else
                'when item length is less than 3, it's suffix
                suffixHolder = Trim(item)
                txtHolder = Left(txtHolder.ToString(), Len(txtHolder) _
                    - Len(suffixHolder)) & suffixHolder.ToString()
                Output0Buffer.ItemName = txtHolder
            End If
        Next
    End If
End Sub