C# Issue with reading XML with chars of different

2019-03-03 14:05发布

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:

var xDoc = XDocument.Load(taxFile);

It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:

XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
    xDoc = XDocument.Load(oReader);
}

This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".

Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.

XmlReader xmlTax = XmlReader.Create(filePath);

And again the workout with StreamReader helps. The same question. It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).

The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.

Looking forward for your replies. Thanks in advance

标签: c# xml encoding
1条回答
啃猪蹄的小仙女
2楼-- · 2019-03-03 14:19

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.

As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered

Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:

The UTF8Encoding object that is returned by this property may not have the appropriate behavior for your application. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.

You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default. http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx

If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

查看更多
登录 后发表回答