How do I XmlDocument.Save() to encoding=“us-ascii”

2019-02-20 02:11发布

问题:

My goal is to get a binary buffer (MemoryStream.ToArray() would yield byte[] in this case) of XML without losing the Unicode characters. I would expect the XML serializer to use numeric character references to represent anything that would be invalid in ASCII. So far, I have:

using System;
using System.IO;
using System.Text;
using System.Xml;

class Program
{
    static void Main(string[] args)
    {
        var doc = new XmlDocument();
        doc.LoadXml("<x>“∞π”</x>");
        using (var buf = new MemoryStream())
        {
            using (var writer = new StreamWriter(buf, Encoding.ASCII))
                doc.Save(writer);
            Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
        }
    }
}

The above program produces the following output:

$ ./ConsoleApplication2.exe
<?xml version="1.0" encoding="us-ascii"?>
<x>????</x>

I figured out how to tell XmlDocument.Save() to use encoding="us-ascii"—by handing it a TextStream with TextStream.Encoding set to Encoding.ASCII. The documentation says The encoding on the TextWriter determines the encoding that is written out. But how can I tell it that I want it to use numeric character entities instead of its default lossy behavior? I have tested that doc.Save(Console.OpenStandardOutput()) writes the expected data (without an XML declaration) as UTF-8 with all of the correct characters, so I know that doc contains the information I wish to serialize. It’s just a matter of figuring out the right way to tell the XML serializer that I want encoding="us-ascii" with character entities…

I understand that it may be non-trivial to write XML documents that are both encoding="us-ascii" and supportive of constructs like <π/> (I think this one might only be doable with external document type definitions. Yes, I have tried just for fun.). But I thought it was quite common to output entities for non-ASCII characters in an ASCII XML document to support preservation of content and attribute value character data in Unicode-unfriendly environments. I thought that numeric character references representing Unicode characters was analogous to using base64 to protect a blob while keeping the content more readable. How do I do this with .NET?

回答1:

You can use XmlWriter instead:

  var doc = new XmlDocument();
    doc.LoadXml("<x>“∞π”</x>");
    using (var buf = new MemoryStream())
    {
        using (var writer =  XmlWriter.Create(buf, 
              new XmlWriterSettings{Encoding= Encoding.ASCII}))
        {
            doc.Save(writer);
        }
        Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
    }

Outputs:

<?xml version="1.0" encoding="us-ascii"?><x>&#x201C;&#x221E;&#x3C0;&#x201D;</x>