Importing and parsing a large XML file in SQL Serv

2019-09-18 00:20发布

I have a large XML file that I need to import and parse into tabular structure ("flatten") in SQL Server. By "large" I mean a file that is around 450 MB, contains up to 6-7 nested levels and lots of elements, ~300.

I tried parsing the file both with OPENXML and Xml.Nodes. Both of the methods are slow. A partial query which reads a parent element and it's nested grandchildren takes several minutes if not dozens to run.

I tried using the SQLXML Bulk Load method. Unfortunately I couldn't - because the file isn't structured properly. There is an element which is logically a parent element which isn't nested as a parent physically.

Do you think the only posiblle solution left is to use .NET or Java? Is there something I'm missing?

I would strongly prefer a dynamic solution, to some degree. I don't want the SQL Server developers to relay on a procedural, compiled, code that they have no control/knowledge about - in the event that some changes will occur (in the XML structure).

Thank you very much.

3条回答
Emotional °昔
2楼-- · 2019-09-18 00:34

Since you want a tabular structure, you could convert the XML to a CSV file (using this java or this .NET tool , or even an XSLT tranformation) and then perform a bulk insert.

Of course, all that depends on your XML being properly formed.

查看更多
一夜七次
3楼-- · 2019-09-18 00:48

Well, first of all I don't really understand why you would use OpenXml to load the file. I am pretty sure that doing that will internally trigger a whole bunch of tests for validity according to OpenXml ISO Standard.

But - Xml.Nodes() (I assume that means the DOM way of loading data) - is by far the slowest way to load and parse Xml data. Consider instead a SAX approach using XmlReader or similar. I do realize that the article is from 2004 - but it still explains the stuff pretty well.

查看更多
来,给爷笑一个
4楼-- · 2019-09-18 00:58

OK. I created an XML Index on the XML data column. (Just a primary one for now).

A query that took ~4:30 minutes before takes ~9 seconds now! Seems that a table that stores the XML with a proper XML Index and the parsing the data with the xml.nodes() function are a feasible solution.

Thank you all.

查看更多
登录 后发表回答