My goal is to get a binary buffer (MemoryStream.ToArray()
would yield byte[]
in this case) of XML without losing the Unicode characters. I would expect the XML serializer to use numeric character references to represent anything that would be invalid in ASCII. So far, I have:
using System;
using System.IO;
using System.Text;
using System.Xml;
class Program
{
static void Main(string[] args)
{
var doc = new XmlDocument();
doc.LoadXml("<x>“∞π”</x>");
using (var buf = new MemoryStream())
{
using (var writer = new StreamWriter(buf, Encoding.ASCII))
doc.Save(writer);
Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
}
}
}
The above program produces the following output:
$ ./ConsoleApplication2.exe
<?xml version="1.0" encoding="us-ascii"?>
<x>????</x>
I figured out how to tell XmlDocument.Save()
to use encoding="us-ascii"
—by handing it a TextStream
with TextStream.Encoding
set to Encoding.ASCII
. The documentation says The encoding on the TextWriter determines the encoding that is written out
. But how can I tell it that I want it to use numeric character entities instead of its default lossy behavior? I have tested that doc.Save(Console.OpenStandardOutput())
writes the expected data (without an XML declaration) as UTF-8 with all of the correct characters, so I know that doc
contains the information I wish to serialize. It’s just a matter of figuring out the right way to tell the XML serializer that I want encoding="us-ascii"
with character entities…
I understand that it may be non-trivial to write XML documents that are both encoding="us-ascii"
and supportive of constructs like <π/>
(I think this one might only be doable with external document type definitions. Yes, I have tried just for fun.). But I thought it was quite common to output entities for non-ASCII characters in an ASCII XML document to support preservation of content and attribute value character data in Unicode-unfriendly environments. I thought that numeric character references representing Unicode characters was analogous to using base64 to protect a blob while keeping the content more readable. How do I do this with .NET?