XDocument.Save() removes my entities

2019-04-28 05:00发布

问题:

I wrote a tool to repair some XML files (i.e., insert some attributes/values that were missing) using C# and Linq-to-XML. The tool loads an existing XML file into an XDocument object. Then, it parses down through the node to insert the missing data. After that, it calls XDocument.Save() to save the changes out to another directory.

All of that is just fine except for one thing: any 
 entities that are in the text in the XML file are replaced with a new line character. The entity represents a new line, of course, but I need to preserve the entity in the XML because another consumer needs it in there.

Is there any way to save the modified XDocument without losing the 
 entities?

Thank you.

回答1:

The 
 entities are technically called “numeric character references” in XML, and they are resolved when the original document is loaded into the XDocument. This makes your issue problematic to solve, since there is no way of distinguishing resolved whitespace entities from insignificant whitespace (typically used for formatting XML documents for plain-text viewers) after the XDocument has been loaded. Thus, the below only applies if your document does not have any insignificant whitespace.

The System.Xml library allows one to preserve whitespace entities by setting the NewLineHandling property of the XmlWriterSettings class to Entitize. However, within text nodes, this would only entitize \r to 
, and not \n to 
.

The easiest solution is to derive from the XmlWriter class and override its WriteString method to manually replace the whitespace characters with their numeric character entities. The WriteString method also happens to be the place where .NET entitizes characters that are not permitted to appear in text nodes, such as the syntax markers &, <, and >, which are respectively entitized to &amp;, &lt;, and &gt;.

Since XmlWriter is abstract, we shall derive from XmlTextWriter in order to avoid having to implement all the abstract methods of the former class. Here is a quick-and-dirty implementation:

public class EntitizingXmlWriter : XmlTextWriter
{
    public EntitizingXmlWriter(TextWriter writer) :
        base(writer)
    { }

    public override void WriteString(string text)
    {
        foreach (char c in text)
        {
            switch (c)
            {
                case '\r':
                case '\n':
                case '\t':
                    base.WriteCharEntity(c);
                    break;
                default:
                    base.WriteString(c.ToString());
                    break;
            }
        }
    }
}

If intended for use in a production environment, you’d want to do away with the c.ToString() part, since it’s very inefficient. You can optimize the code by batching substrings of the original text that do not contain any of the characters you want to entitize, and feeding them together into a single base.WriteString call.

A word of warning: The following naive implementation will not work, since the base WriteString method would replace any & characters with &amp;, thereby causing \r to be expanded to &amp;#xA;.

    public override void WriteString(string text)
    {
        text = text.Replace("\r", "&#xD;");
        text = text.Replace("\n", "&#xA;");
        text = text.Replace("\t", "&#x9;");
        base.WriteString(text);
    }

Finally, to save your XDocument into a destination file or stream, just use the following snippet:

using (var textWriter = new StreamWriter(destination))
using (var xmlWriter = new EntitizingXmlWriter(textWriter))
    document.Save(xmlWriter);

Hope this helps!

Edit: For reference, here is an optimized version of the overridden WriteString method:

public override void WriteString(string text)
{
    // The start index of the next substring containing only non-entitized characters.
    int start = 0;

    // The index of the current character being checked.
    for (int curr = 0; curr < text.Length; ++curr)
    {
        // Check whether the current character should be entitized.
        char chr = text[curr];
        if (chr == '\r' || chr == '\n' || chr == '\t')
        {
            // Write the previous substring of non-entitized characters.
            if (start < curr)
                base.WriteString(text.Substring(start, curr - start));

            // Write current character, entitized.
            base.WriteCharEntity(chr);

            // Next substring of non-entitized characters tentatively starts
            // immediately beyond current character.
            start = curr + 1;
        }
    }

    // Write the trailing substring of non-entitized characters.
    if (start < text.Length)
        base.WriteString(text.Substring(start, text.Length - start));
}


回答2:

If your document contains insignificant whitespace which you want to distinguish from your &#xA; entities, you can use the following (much simpler) solution: Convert the &#xA; character references temporarily to another character (that is not already present in your document), perform your XML processing, and then convert the character back in the output result. In the example below, we shall use the private character U+E800.

static string ProcessXml(string input)
{
    input = input.Replace("&#xA;", "&#xE800;");
    XDocument document = XDocument.Parse(input);
    // TODO: Perform XML processing here.
    string output = document.ToString();
    return output.Replace("\uE800", "&#xA;");
}

Note that, since XDocument resolves numeric character references to their corresponding Unicode characters, the "&#xE800;" entities would have been resolved to '\uE800' in the output.

Typically, you can safely use any codepoint from the Unicode’s “Private Use Area” (U+E000U+F8FF). If you want to be extra safe, perform a check that the character is not already present in the document; if so, pick another character from the said range. Since you’ll only be using the character temporarily and internally, it does not matter which one you use. In the very unlikely scenario that all private use characters are already present in the document, throw an exception; however, I doubt that that will ever happen in practice.