How to create an SSIS import raw file package?

2019-09-18 16:01发布

问题:

I am very new to SSIS. I am using SSIS 2008. I see that there are many tools which perform same function as some SQL operators. When should I use the SSIS tools vs. TSQL operators? Also, any advice on a more efficient solution here?

Below is the tsql query that I selected from the SSIS Import/Export wizard. So my current solution does not use any SSIS tools other than one Data flow source and one data flow destination.

SELECT
   en.uniqueid_c AS enrollment_id,
   CONVERT(nvarchar (20),c.clientcode_c) AS client_id, --Legacy CDT# (note this has to be the same value as client_id on the other tables)
   CONVERT(nvarchar (20),CASE --Program codes included here for enrollment data (Excludes enrollments with modifiers)
                         WHEN en.agency_c = 'ADO' THEN 'ADO'
                         WHEN en.agency_c = 'ADOT' THEN 'ADO'
                         WHEN en.agency_c in ('MRDD/IHS','MRDD/PSH','MRDD/REP','MRDD/RTC') THEN 'CHOICES'
                         WHEN en.agency_c = 'FPP' THEN 'CM' 
                         WHEN en.agency_c = 'CMGT' THEN 'CM'
                         WHEN en.agency_c = 'EDU' THEN 'EDU'
                         WHEN en.agency_c = 'COM' THEN 'GH'
                         WHEN en.agency_c = 'CSUP' THEN 'INT'
                         WHEN en.agency_c = 'IHS' THEN 'INT'
                         WHEN en.agency_c = 'IHST' THEN 'INT'
                         WHEN en.agency_c = 'MST' THEN 'MST'
                          WHEN en.agency_c = 'OMHS' THEN 'OMHS'
                        WHEN en.agency_c = 'ORTC' THEN 'RESA' 
                         WHEN en.agency_c = 'MRTC' THEN 'RESA' 
                         WHEN en.agency_c = 'RTC' THEN 'RESA'
                         WHEN en.agency_c = 'RFC' THEN 'RFC'
                         WHEN en.agency_c in ('SCCR','SCMN','SCRP') THEN 'SCS'
                         WHEN en.agency_c = 'SUB' THEN  'SUB' --uncertain about this one KMH - 06/23/10
                         WHEN en.agency_c = 'STFC' THEN 'TFC'
                         WHEN en.agency_c = 'MTFC' THEN 'TFC' 
                         WHEN en.agency_c = 'TFC' THEN 'TFC'
                         WHEN en.agency_c = 'TL' THEN 'TL'
                         WHEN en.agency_c = 'TLT' THEN 'TL'
                         ELSE en.agency_c
                         END) AS program_code,
-------------------------------------------------------------------------------------------------------------------------------------------                      
   --2nd Program_code entry handles program_modifier_code.
   --The codes need to be grouped and cased out to match the Evolv codes. --This was fixed.
   -- NOTE!!!! The codes below will need to be replaced with the finance modifiers just for TN. --per deneen and diane.
   UPPER(CONVERT(nvarchar(20), CASE --PROGRAM MODIFIERS --This is pulled into the program_code in the 2nd run. These should exclude non-modifiers.
                         WHEN en.agency_c in ('ADO','ADOT','IHST','TLT') THEN 'TRANS'
                         WHEN en.agency_c in ('CHOM','CGMT','COM','CSUP','EDU','IHS','LHS','MRTC','MST','MTFC','RTC', --use this to exclude recs
                                              'SCCR','SCRP','SUB','TFC','TL','ZADMIN','ZAWOL','ZDET','ZHOSP') THEN en.agency_c
                         WHEN en.agency_c = 'FPP' THEN 'FPP' 
                         WHEN en.agency_c = 'RFC' THEN 'RFC'
                         WHEN en.agency_c = 'TFC' THEN 'TFC'                    
                         WHEN en.agency_c = 'MRDD/IHS' THEN 'MRIHS'
                         WHEN en.agency_c = 'MRDD/PSH' THEN 'MRPSH'
                         WHEN en.agency_c = 'MRDD/REP' THEN 'MRREP'
                         WHEN en.agency_c = 'MRDD/RTC' THEN 'MRRTC'
                         WHEN en.agency_c = 'PD40' THEN 'PD40'
                         WHEN en.agency_c = 'PDET' THEN 'PDET'
                         WHEN en.agency_c = 'PINT' THEN 'PINT'
                         WHEN en.agency_c = 'PLV4' THEN 'PLV4'
                         WHEN en.agency_c = 'PWIL' THEN 'PWIL'
                         WHEN en.agency_c = 'PYDC' THEN 'PYDC'
                         WHEN en.agency_c = 'SCMN' THEN 'SCMN'
                         WHEN en.agency_c = 'STFC' THEN 'STFC'
                         ELSE en.agency_c                      
                         END)) AS program_modifier_code,
 -------------------------------------------------------------------------------------------------------------------------------------------  
 /*
 Group Homes and Inner Harbour locations were added on 7/26/10 - KMH
 */                   
   CONVERT(nvarchar(20),
   CASE
       WHEN en.location_c = 'ANNI' THEN 'AL-ANNI' 
       WHEN en.location_c = 'ASHE' THEN 'NC-ASHE'
       WHEN en.location_c = 'ATL' THEN 'GA-ATL'
       WHEN en.location_c = 'RMBT' THEN 'RTC-TN-BC'       
       WHEN en.location_c = 'BIL' THEN 'MS-BIL'
       WHEN en.location_c = 'BIRM' THEN 'AL-BIRM'
       WHEN en.location_c = 'BOST' THEN 'MA-BOST'
       WHEN en.location_c = 'CIRT' THEN 'RTC-TN-CIRT'
       WHEN en.location_c = 'CHAR' THEN 'NC-CHAR'
       WHEN en.location_c = 'CHAT' THEN 'TN-CHAT'
       WHEN en.location_c = 'CLAR' THEN 'TN-CHAR'
       WHEN en.location_c = 'COL' THEN 'TN-COL'
       WHEN en.location_c = 'CMS' THEN 'MS-COL'
       WHEN en.location_c = 'CCRD' THEN 'NC-CCRD'
       WHEN en.location_c = 'COOK' THEN 'TN-COOK'
       WHEN en.location_c = 'DAL' THEN 'TX-DAL'
       WHEN en.location_c = 'RDV' THEN 'RTC-TN-DV'
       WHEN en.location_c = 'DKSN' THEN 'TN-DKSN'
       WHEN en.location_c = 'RDW' THEN 'RTC-TN-DW'
       WHEN en.location_c = 'DOTH' THEN 'AL-DOTH'
       WHEN en.location_c = 'DUR' THEN 'NC-DURH'
       WHEN en.location_c = 'DYER' THEN 'TN-DYER'
       WHEN en.location_c = 'FAYE' THEN 'NC-FAYE'
       WHEN en.location_c = 'GCRT' THEN 'RTC-TN-GCRT'
       WHEN en.location_c = 'GRNB' THEN 'NC-GRNB'
       WHEN en.location_c = 'GRNV' THEN 'NC-GRNV'
       WHEN en.location_c = 'HMS' THEN 'MS-HMS'
       WHEN en.location_c = 'DMS' THEN 'MD-DMS'
       WHEN en.location_c = 'HICK' THEN 'NC-HICK'
       WHEN en.location_c = 'HILL' THEN 'NC-HILL'
       WHEN en.location_c = 'HUNT' THEN 'AL-HUNT'
       WHEN en.location_c = 'INNH' THEN 'RTC-GA-INNH'
       WHEN en.location_c = 'JMS' THEN 'MS-JMS'
       WHEN en.location_c = 'JTN' THEN 'TN-JTN'
       WHEN en.location_c = 'JCTN' THEN 'TN-JCTN'
       WHEN en.location_c = 'KNOX' THEN 'TN-KNOX'
       WHEN en.location_c = 'LAKE' THEN 'FL-LAKE'
       WHEN en.location_c = 'LAWR' THEN 'MA-LAWR'
       WHEN en.location_c = 'MANC' THEN 'NH-MANC'
       WHEN en.location_c = 'MCB' THEN 'MS-MCC'
       WHEN en.location_c = 'MEM' THEN 'TN-MEM'
       WHEN en.location_c = 'MMS' THEN 'MS-MMS'
       WHEN en.location_c = 'MIAM' THEN 'FL-MIAM'
       WHEN en.location_c = 'MIDM' THEN 'TN-MIDM'
       WHEN en.location_c = 'MOBI' THEN 'AL-MOBI'
       WHEN en.location_c = 'MONT' THEN 'AL-MONT'
       WHEN en.location_c = 'MRSN' THEN 'TN-MRSN'
       WHEN en.location_c = 'NASH' THEN 'TN-NASH'
       WHEN en.location_c = 'OCAL' THEN 'FL-OCAL'
       WHEN en.location_c = 'PAR' THEN 'TN-PAR'
       WHEN en.location_c = 'PINE' THEN 'NC-PINE'
       WHEN en.location_c = 'ROAN' THEN 'VA-ROAN'
       WHEN en.location_c = 'SPRG' THEN 'MA-SPRI/HOLY'
       WHEN en.location_c = 'PETE' THEN 'FL-STPET'
       WHEN en.location_c = 'TAMP' THEN 'FL-TAMP'
       WHEN en.location_c = 'TUP' THEN 'MS-TUP'
       WHEN en.location_c = 'WDC' THEN 'DC-WDC'
       WHEN en.location_c = 'WILM' THEN 'NC-WILM'
       WHEN en.location_c = 'WBRN' THEN 'MA-WBRN'
       WHEN en.location_c = 'WORC' THEN 'MA-WORC'
       WHEN en.location_c = 'GM' THEN 'TN-GM'
       WHEN en.location_c = 'GN' THEN 'TN-GN'
   ELSE en.location_c
   END)
       as service_facility_code,
   en.startdate_d AS start_date,
   en.enddate_d AS end_date,
   c.refdate_d AS referral_date,
   ep.enddate_d AS overall_discharge_date, --Episode end date
   CONVERT(nvarchar(20),c.altclientcode_vc) AS org_id,-- TNKIDS#
   UPPER(CONVERT(nvarchar(50), CASE
                         WHEN en.enddate_d = ep.enddate_d THEN ep.accountnumber_vc
                         WHEN en.enddate_d < ep.enddate_d THEN 'TWA'
                         END)) AS discharged_to_type,
   UPPER(CONVERT (nvarchar(20), CASE
                         WHEN ep.accountnumber_vc in ('DORM','INDEP/SUP','INDEP/SELF','INDEP/NR','INDEP/FR') THEN 07
                         WHEN ep.accountnumber_vc in ('JAIL','DET') THEN 01
                         WHEN ep.accountnumber_vc in ('BIOL') THEN 02
                         WHEN ep.accountnumber_vc in ('ADOPT/DCS','ADOPT/PAR','ADOPT/YV') THEN 06
                         WHEN ep.accountnumber_vc in ('REL') THEN 03
                         WHEN ep.accountnumber_vc in ('PSYCH','EMER','RTC') THEN 04
                         ELSE 99
                         END)) AS discharged_to_type_code,
   CONVERT(nvarchar(300),'cd.enrollments') AS original_table_name,
   CONVERT(nvarchar (400), en.alerts_vc) AS remarks,
   CONVERT(varchar(50),  CASE 
                         WHEN en.disreason_c = 'ADMI' THEN 'Administrative'
                         WHEN en.disreason_c = 'AMA' THEN 'Against Medical Advice'
                         WHEN en.disreason_c = 'AWOL' THEN 'Absent Without Leave'
                         WHEN en.disreason_c = 'DCSD' THEN 'Deceased'
                         WHEN en.disreason_c = 'JC' THEN 'Juvenille Court'
                         WHEN en.disreason_c = 'NP' THEN 'No Progress'
                         WHEN en.disreason_c = 'TMED' THEN 'Transfer to Medical Treatment Facility'
                         WHEN en.disreason_c = 'TPSY' THEN 'Transfer to Inpatient Psychiatric Facility'
                         WHEN en.disreason_c = 'TW' THEN 'Transfer within Agency'
                         WHEN en.disreason_c = 'WMA' THEN 'With Medical Advice'
                         ELSE 'Other'
                         END)AS outcome,
   CONVERT(varchar(5),   CASE
                         WHEN en.disreason_c in ('ADMI','AMA','AWOL','NP') THEN 'CBT'
                         WHEN en.disreason_c in ('DCSD','WMA') THEN 'DLR'
                         WHEN en.disreason_c in ('JC') THEN 'RSF'
                         WHEN en.disreason_c in ('TMED','TPSY') THEN 'DMR'
                         WHEN en.disreason_c in ('TW') THEN 'RPA'
                         ELSE 'CBT' 
                         END) AS outcome_code,
--Populate service_facility_unit table and add case statement for loading CDT program_c into client_enrollment room_number 7/27/10 KMH
   UPPER(CONVERT(varchar(10),  CASE
                         WHEN en.program_c = 'BT1L' THEN 'BC1L'
                         WHEN en.program_c = 'BT1R' THEN 'BC1R'
                         WHEN en.program_c = 'BT2L' THEN 'BC2L'
                         WHEN en.program_c = 'BT2R' THEN 'BC2R'
                         WHEN en.program_c = 'BT3' THEN 'BC3'
                         WHEN en.program_c = 'BT3L' THEN 'BC3L'
                         WHEN en.program_c = 'BT3R' THEN 'BC3R'
                         WHEN en.program_c = 'BT4L' THEN 'BC4L'
                         WHEN en.program_c = 'BT4R' THEN 'BC4R'
                         WHEN en.program_c = 'BT5' THEN 'BC5'
                         WHEN en.program_c = 'BT6' THEN 'BC6'
                         WHEN en.program_c = 'CRT1' and en.location_c = 'CIRT' THEN 'BCRT1'
                         WHEN en.program_c = 'CRT2' and en.location_c = 'CIRT' THEN 'BCRT2'
                         WHEN en.program_c = 'CRT3' and en.location_c = 'CIRT' THEN 'BCRT3'
                         WHEN en.program_c = 'CRT4' and en.location_c = 'CIRT' THEN 'BCRT4'
                         WHEN en.program_c = 'CRT1' and en.location_c = 'GCRT' THEN 'GCRT1'
                         WHEN en.program_c = 'CRT2' and en.location_c = 'GCRT' THEN 'GCRT2'
                         WHEN en.program_c = 'CRT3' and en.location_c = 'GCRT' THEN 'GCRT3'
                         WHEN en.program_c = 'CRT4' and en.location_c = 'GCRT' THEN 'GCRT4'
                         WHEN en.program_c = 'DVC' THEN 'DV1'
                         WHEN en.program_c = 'DVM' THEN 'DV2'
                         WHEN en.program_c = 'DVN' THEN 'DV3'
                         WHEN en.program_c = 'DVP' THEN 'DV4'
                         WHEN en.program_c in ('DW1','DW2','DW3','DW4','DW5','DW6','DW7','DW8') and en.location_c = 'RDW' THEN en.program_c
                         WHEN en.program_c in ('IH01','IH02','IH03','IH04','IH05','IH06','IH07') and en.location_c = 'INNH' THEN 'IH3'
                         WHEN en.program_c = 'IH08' and en.location_c = 'INNH' THEN 'IH1'
                         WHEN en.program_c = 'IH09' and en.location_c = 'INNH' THEN 'IH2'
                         ELSE 'NA'                       
                         END)) as room_number

FROM
   ar.client c 
   INNER JOIN cd.enrollments en ON (c.uniqueid_c = en.clientid_c)
   INNER JOIN cd.episode ep ON (ep.uniqueid_c = en.episodeid_c and ep.clientid_c = c.uniqueid_c)

WHERE
   (ep.enddate_d is NULL OR ep.enddate_d >= getdate()-730) and
   en.location_c in (select code from dbo.yv_LKUP_OfficeLocation where state in ('TX', 'FL'))
order by 2

回答1:

General purpose SSIS advice while I wait for clarification of needs.

When should I use the SSIS tools vs. TSQL operators

People are often tempted to use the out of the box transformations as it looks like the right thing to do. Select the table in the drop down list, add a sort, add another data source, sort that too, merge join, maybe an aggregate.

When the problem domain is small, say tens to hundreds of thousands, the difference in processing is negligible. If a package runs in 2 minutes instead of 1 or consumes 80% of server memory vs 40% during processing, people may not notice.

When the data volume hits the tipping point though, poor package design decisions are going to eat your lunch.

Sorts

When your source RDBMS has a request for sorted data, there may be a clustered index or something in the database that can save the time of actually sorting the data. When SSIS gets a request for sorted data, you're going to pay many times over for that operations.

A sort in SSIS is a fully-blocking, asynchronous operation. That means all the data flowing through that point must arrive at that transformation, be operated on before it can be sent downstream. Have a bajillion rows coming through or a really slow source, you'll really notice it when it hits one of these operations. Maybe you say, I can wait because I really need the data sorted but time's not the only resource you're spending. You are also doubling your memory requirement as asynchronous transformations require copying data from one buffer to another.

Maybe you still accept the costs time and memory usage for the ease of using an OOB item but you may not be finished paying. Your server has 32GB of memory and SSIS gets to use them all. Each row costs a thousand bytes and you have 16M rows of data flowing through your data flow. It hits the sort and data starts piling up. Once the last row arrives, you've consumed 16GB of memory for the original data. The sort operations starts to sort and it copies 16GB to another 16GB of memory and oops, SSIS is out of memory. You now pay the third price of temporary file storage. When the execution engine is under memory pressure, it will eventually start paging to disk. Once that happens, the game is over if you care about performance but your suffering may not be so. If you haven't set the BlobTempStoragePath value for each data flow, that file's going to be written to the default temporary storage location which is probably C:\something or other. Your systems admins sliced out a very lean C partition as only the OS goes on there so a 16GB swap file suddenly being written to that drive consumes all the available space and then the OS is going to get unhappy, the package fails and the finger pointing begins. Not that I've ever been there

Moral of the story

As much as possible, do everything in the source system. The above scenario is for Sorts but the lesson is applicable across all the "shared" operators.

Any advice on a more efficient solution here?

As to how you can clean up your query, those mappings would drive me insane. Any chance you can create N lookup tables (or inline table valued function) to provide a mapping between the stored value and the presented value? You could then abstract away all the case logic.

References

  • Blocking, Semi-blocking and non-blocking components
  • Kama Sutra of SSIS: A guide to loving ETL <- my talk on SSIS

Finally, the numbers in this post are amazingly dependent on hardware and workloads



标签: tsql ssis