-->

Parsing a large XML file to multiple output xmls,

2019-09-20 04:25发布

问题:

I need to take a very large XML file and create multiple output xml files from what could be thousands of repeating nodes of the input file. There is no whitespace in the source file "AnimalBatch.xml" which looks like this:

<?xml version="1.0" encoding="utf-8" ?><Animals><Animal id="1001"><Quantity>One</Quantity><Adjective>Red</Adjective><Name>Rooster</Name></Animal><Animal id="1002"><Quantity>Two</Quantity><Adjective>Stubborn</Adjective><Name>Donkeys</Name></Animal><Animal id="1003"><Quantity>Three</Quantity><Adjective>Blind</Adjective><Name>Mice</Name></Animal><Animal id="1004"><Quantity>Four</Quantity><Adjective>Purple</Adjective><Name>Horses</Name></Animal><Animal id="1005"><Quantity>Five</Quantity><Adjective>Long</Adjective><Name>Centipedes</Name></Animal><Animal id="1006"><Quantity>Six</Quantity><Adjective>Dark</Adjective><Name>Owls</Name></Animal></Animals>

The program needs to split the repeating "Animal" and produce the appropriate number of files named: Animal_1001.xml, Animal_1002.xml, Animal_1003.xml, etc.


Animal_1001.xml:
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>


Animal_1002.xml
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>


Animal_1003.xml>
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Three</Quantity>
<Adjective>Blind</Adjective>
<Name>Mice</Name>
</Animal>

The code below works, but only if the input file has CR/LF after the <Animal id="xxxx"> elements. If it has no "whitespace" (I don't, and can't get it like that), I get every other one (the odd numbered animals)

    static void SplitXMLReader()
    {
        string strFileName;
        string strSeq = "";

        XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");

        while (doc.Read())
        {
            if ( doc.Name == "Animal"  && doc.NodeType == XmlNodeType.Element )
            {
                strSeq = doc.GetAttribute("id"); 

                XmlDocument outdoc = new XmlDocument();
                XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);                     
                XmlElement rootNode = outdoc.CreateElement(doc.Name);

                rootNode.InnerXml = doc.ReadInnerXml();  
                // This seems to be advancing the cursor in doc too far.

                outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
                outdoc.AppendChild(rootNode);

                strFileName = "Animal_" + strSeq + ".xml";
                outdoc.Save("C:\\" + strFileName);                    
            }
        }
    }

My understanding is that "whitespace" or formatting in XML should make no difference to XmlReader - but I've tried this both ways, with and without CR/LF's after the <Animal id="xxxx">, and can confirm there is a difference. If it has CR/LFs (possibly even just a space, which I'll try next) - it gets each <Animal> node processed fully, and saved under the right filename that comes from the id attribute.

Can someone let me know what's going on here - and a possible workaround?

回答1:

yes, when using the doc.readInnerXml() white space is important.

From the documentation of the function. This returns a string. so of course white space will matter. If you want the inner text as a xmlNode you should use something like this



回答2:

Thanks for the guidance on using the ReadSubTree() method:

This code works for the XML input file with no linefeeds:

    static void SplitXMLReaderSubTree()
    {
        string strFileName;
        string strSeq = "";
        XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");

        while (!doc.EOF)
        {
            if ( doc.Name == "Animal"  && doc.NodeType == XmlNodeType.Element )
            {
                strSeq = doc.GetAttribute("id");
                XmlReader inner = doc.ReadSubtree();
                inner.Read();
                XmlDocument outdoc = new XmlDocument();
                XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);
                XmlElement myElement;
                myElement = outdoc.CreateElement(doc.Name);
                myElement.InnerXml = inner.ReadInnerXml();
                inner.Close();
                myElement.Attributes.RemoveAll();
                outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
                outdoc.ImportNode(myElement, true);
                outdoc.AppendChild(myElement);
                strFileName = "Animal_" + strSeq + ".xml";
                outdoc.Save("C:\\" + strFileName);                    
            }
            else
            {
                doc.Read();
            }
        }