C# Issue with reading XML with chars of different

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:

var xDoc = XDocument.Load(taxFile);

It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:

XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
    xDoc = XDocument.Load(oReader);
}

This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".

Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.

XmlReader xmlTax = XmlReader.Create(filePath);

And again the workout with StreamReader helps. The same question. It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).

The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.

Looking forward for your replies. Thanks in advance

标签： c# xml encoding

1条回答

啃猪蹄的小仙女

2楼-- · 2019-03-03 14:19

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.

As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered

Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:

The UTF8Encoding object that is returned by this property may not have the appropriate behavior for your application. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.

You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default. http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx

If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

0人赞添加讨论(0) 举报

C# Issue with reading XML with chars of different

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间