What is the largest XML file SSIS can extract data

2019-05-07 02:36发布

问题:

We have an architecture where we use SSIS to extract data from XML batch files into a staging database for validation, prior to exporting it into production.

We control the XML format to some extent, and I've been asked to determine what the maximum number of records the XML batch file should contain. Based on the XML schema and some sample data, I can estimate the average record size and do some projections from there.

However, coming at it from the other angle, I'd like to get an indication of the technical limitations of SSIS when dealing with large XML files.

I'm aware that SSIS will flatten and transform the XML document into its own tabular, in-memory representation, so RAM becomes an obvious limiting factor but in what proportion?

Can you say something like, SSIS requires something like at least 2.5 times the size of the file you're trying to load, in available memory? Assuming that I have a 32GB box dedicated to this data-loading function, how large can my XML files be?

I'm aware that there might other factors included, such as the complexity of the schema, number of nested elements, etc. but it'd be nice to have a starting point.

回答1:

Xml Source does not load the whole document in memory, but streams data out as it reads it from XML file. So if you are reading the XML and writing it to e.g. text files without complex transformations, you need relatively little memory. Also the amount of memory you need (after some threshold) stops growing when XML file grows - so you may handle potentially unlimited XML files.

E.g. this guy exported the whole Wikipedia content (20Gb XML file): http://www.ideaexcursion.com/2009/01/26/import-wikipedia-articles-into-sql-server-with-ssis/

Of course, you will probably do something with that data, e.g. join multiple streams coming out of the XML Source. Depending on what you need, you might need a lot of memory, because some transforms do keep the whole dataset in memory, or perform much better if you have enough memory for the whole dataset.



回答2:

It's not that simple.

First of all, be aware that SSIS "flattens" the XML such that there is one output from the XML Source for each path through the XML. A trivial example is that

<Parent><Child><Grandchild/></Child></Parent>

will produce three outputs and three error outputs. It gets worse:

<Parent><Child><Grandchild><Notes/></Grandchild><Notes/></Child><Notes/></Parent>

This will produce Parent, Child, Grandchild, Parent-Child-Grandchild-Notes, Parent-Child-Notes, and Parent-Notes outputs, both normal and error outputs.

A project I worked on started out having about 203 outputs. I was able to flatten the XML schema and produce only 19 or so. That's still a lot considering that each output needs to have it's own processing done.

Also, the XML Task cannot handle 1GB or more of XML. It does, indeed, load the entire document into memory. Try doing an XmlDocument.Load of such a file and watch what happens - that's what happens to SSIS.

I had to create an "XML Element Source" of my own, that processed the child elements of the root element one at a time. This enabled me to flatten the XML as well as handle large documents (a 10GB test document worked).

There's more fun depending on what you want to do with the resulting data. In my case, we had to send each of the outputs to staging tables. This is not a bad thing, but you have to realize that the data in the outputs are asynchronous. A single child element (with descendants) will reach the ends of the output paths a little at a time, and you'll never know when all the descendants have finished being processed. This makes it impossible to handle the processing on a transactional basis one element at a time.

Instead, SSIS adds a surrogate key (I think that's what it's called) to each child element. There would be a ParentID added to the parent, a ChildID to the child, and a ChildParentID also added to the child, to refer to the parent of the child. These can be used to "put the element back together again", but only after all data have finished being written to the staging tables. That's the only time you can be sure that any given element has been processed in full - when they all have!



标签: xml ssis