XmlDocument.Load() method fails to decode € (euro)

2019-06-08 12:15发布

I have an XML document file.xml which is encoded in Iso-latin-15 (aka Iso-Latin-9)

<?xml version="1.0" encoding="iso-8859-15"?>
<root xmlns="http://stackoverflow.com/demo">
  <f>€.txt</f>
</root>

From my favorite text editor, I can tell this file is correctly encoded in Iso-Latin-15 (it is not UTF-8).

My software is written in C# and wants to extract the element f.

XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("file.xml"); 

In real life, I have a XMLResolver to set credentials. But basically, my code is as simple as that. The loading goes smoothly, I don't have any exception raised.

Now, my problem when I extract the value:

//xnsm is the XmlNameSpace manager
XmlNode n = xmlDoc.SelectSingleNode("//root/f", xnsm); 
if (n != null)
  String filename = n.InnerText;

The Visual Studio debugger displays filename = □.txt

It could only be a Visual Studio bug. Unfortunately File.Exists(filename) returns false, whereas the file actually exist.

What's wrong?

3条回答
成全新的幸福
2楼-- · 2019-06-08 12:37

Don't just use the debugger or the console to display the string as a string.

Instead, dump the contents of the string, one character at a time. For example:

foreach (char c in filename)
{
    Console.WriteLine("{0}: {1:x4}", c, (int) c);
}

That will show you the real contents of the string, in terms of Unicode code points, instead of being constrained by what the current font can display.

Use the Unicode code charts to look up the characters specified.

查看更多
老娘就宠你
3楼-- · 2019-06-08 12:58
  1. Does your xml define its encoding correctly ? encoding="iso-8859-15" .. is that Iso-latin-15

  2. Ideally, you should put your content inside a CDATA element .. so the xml would look like <f><![CDATA[€.txt]]></f>

  3. Ideally, you should also escape all special characters with equivalent url-encoded (or http-encoded) values, because xml typically is for communicating through http.

I dont know the exact escape code for € .. but it would be something of this sort

<f><![CDATA[%3E.txt]]></f>

The above should make € be communicated correctly through the xml.

查看更多
Luminary・发光体
4楼-- · 2019-06-08 12:59

If I remember correctly the XmlDocument.Load(string) method always assumes UTF-8, regardless of the XML encoding.

You would have to create a StreamReader with the correct encoding and use that as the parameter.

xmlDoc.Load(new StreamReader(
                     File.Open("file.xml"), 
                     Encoding.GetEncoding("iso-8859-15"))); 

EDIT:

I just stumbled across KB308061 from Microsoft. There's an interesting passage:

Specify the encoding declaration in the XML declaration section of the XML document. For example, the following declaration indicates that the document is in UTF-16 Unicode encoding format:

<?xml version="1.0" encoding="UTF-16"?>

Note that this declaration only specifies the encoding format of an XML document and does not modify or control the actual encoding format of the data.

查看更多
登录 后发表回答