-->

How to best detect encoding in XML file?

2019-01-26 10:20发布

问题:

To load XML files with arbitrary encoding I have the following code:

Encoding encoding;
using (var reader = new XmlTextReader(filepath))
{
    reader.MoveToContent();
encoding = reader.Encoding;
}

var settings = new XmlReaderSettings { NameTable = new NameTable() };
var xmlns = new XmlNamespaceManager(settings.NameTable);
var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default, 
    encoding);
using (var reader = XmlReader.Create(filepath, settings, context))
{
    return XElement.Load(reader);
}

This works, but it seems a bit inefficient to open the file twice. Is there a better way to detect the encoding such that I can do:

 1. Open file
 2. Detect encoding
 3. Read XML into an XElement
 4. Close file

回答1:

Ok, I should have thought of this earlier. Both XmlTextReader (which gives us the Encoding) and XmlReader.Create (which allows us to specify encoding) accepts a Stream. So how about first opening a FileStream and then use this with both XmlTextReader and XmlReader, like this:

using (var txtreader = new FileStream(filepath, FileMode.Open))
{
    using (var xmlreader = new XmlTextReader(txtreader))
    {
        // Read in the encoding info
        xmlreader.MoveToContent();
        var encoding = xmlreader.Encoding;

        // Rewind to the beginning
        txtreader.Seek(0, SeekOrigin.Begin);

        var settings = new XmlReaderSettings { NameTable = new NameTable() };
        var xmlns = new XmlNamespaceManager(settings.NameTable);
        var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default,
                 encoding);

        using (var reader = XmlReader.Create(txtreader, settings, context))
        {
            return XElement.Load(reader);
        }
    }
}

This works like a charm. Reading XML files in an encoding independent way should have been more elegant but at least I'm getting away with only one file open.



回答2:

Another option, quite simple, is to use Linq to XML. The Load method automatically reads the encoding from the xml file. You can then get the encoder value by using the XDeclaration.Encoding property. An example from MSDN:

// Create the document
XDocument encodedDoc16 = new XDocument(
new XDeclaration("1.0", "utf-16", "yes"),
new XElement("Root", "Content")
);
encodedDoc16.Save("EncodedUtf16.xml");
Console.WriteLine("Encoding is:{0}", encodedDoc16.Declaration.Encoding);
Console.WriteLine();

// Read the document
XDocument newDoc16 = XDocument.Load("EncodedUtf16.xml");
Console.WriteLine("Encoded document:");
Console.WriteLine(File.ReadAllText("EncodedUtf16.xml"));
Console.WriteLine();
Console.WriteLine("Encoding of loaded document is:{0}", newDoc16.Declaration.Encoding);

While this may not server the original poster, as he would have to refactor a lot of code, it is useful for someone who has to write new code for their project, or if they think that refactoring is worth it.