Performance impact of using NTEXT field in SSIS

2019-02-25 04:33发布

We are bringing in data from a source system to our data warehouse.

They are changing some of their columns from nvarchar(255) to nvarchar(max). Looking at their data it seems like 95% of the data is under 255 characters and 99% (probably 100%) under 4000 characters. We're having to change our SSIS package metadata to use NTEXT to bring in these columns (they're converting 8 columns across 4 tables).

I'm not familiar with how SSIS handle nvarchar(max) (in SQL Server) /NTEXT (in SSIS) and what performance impact (if any) there will be. Does SSIS handle these datatypes differently and possibly more slowly?

We basically do a straight read and dump into our staging environment and then proceed from there. I know the limitations on the SQL side but not on the SSIS phase.

SQL Server 2012 / SSIS 2012

1条回答
做个烂人
2楼-- · 2019-02-25 04:55

Stop the owners of the source system now before they have a disaster on their hands. See also

Those articles and answers cover some of the reasons this is going to hurt your database.

Let's talk about SSIS and the Data Flow Task. The data flow task works on the concept of buffers. A buffer is a pre-size unit of memory that N rows of data will fit into. This is one of the fundamental reasons SSIS is so persnickety about data types and is always checking them. We can fit 1000 @ 100 byte buffers into memory. But then you go and double the size of all the columns in the source and suddenly, only 500 fit rows fit into memory at one time. That likely slows down the total throughput.

Binary/Large object data/stream data is different. SSIS doesn't/can't know whether that data will fit into memory. You might dodge a bullet and it the execution engine will be able to fit your strings in the buffer and while you could have some slow down, it may not be appreciable. But, what if the engine determines this really is LOB data? Then you're humped. You can watch your performance slow to a crawl as SSIS creates all these lovely little files in your BLOBTempStoragePath location, and possibly BufferTempStoragePath. Instead of having data in the buffer (memory = fast) you instead have a pointer in your buffer to a physical path on disk (disk = slow).

And you get to pay the performance penalty a lot for it. Think about it. You pull all the data out of dbo.BadDecision for your ETL process. That data is likely not all in the buffer cache so SQL Server, or whatever host DB you are using has to read all that data in from disk. Disk is slow. Maybe you stream that across the network to an ETL server. Those fat types can't fit into memory so they get written back out to disk, possibly not even fast disk because you didn't set the buffer paths. Assuming you don't fill up the C: and crash the server, you pay to write the data you just read off of disk into individual little files while the data moves through the pipeline. That batch of records is now ready to be written to the target tables - guess what? We need to read that data back out of those files to actually store it. And of course, the target system then rewrites that data to disk.

Sounds like fun, doesn't it?

So fight, fight this with everything you have. It is a poor, lazy decision that is going to bite the company, not just you and your team, heavily in the ass.

Was unclear the exact mechanics of the engine spilling to disk for LOBs. Article by John Welch clarifies: LOBs in the SSIS Dataflow

Because these data types have the potential to hold so much data, SSIS handles them a little differently than the standard data types. They are allocated separately from regular data in the buffers. When there is memory pressure, SSIS spools buffers to disk. The potential size of LOB data makes it very likely to be spooled, which can be a pretty major performance bottleneck.

查看更多
登录 后发表回答