I have an XML document file.xml
which is encoded in Iso-latin-15 (aka Iso-Latin-9)
<?xml version="1.0" encoding="iso-8859-15"?>
<root xmlns="http://stackoverflow.com/demo">
<f>€.txt</f>
</root>
From my favorite text editor, I can tell this file is correctly encoded in Iso-Latin-15 (it is not UTF-8).
My software is written in C# and wants to extract the element f
.
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("file.xml");
In real life, I have a XMLResolver to set credentials. But basically, my code is as simple as that. The loading goes smoothly, I don't have any exception raised.
Now, my problem when I extract the value:
//xnsm is the XmlNameSpace manager
XmlNode n = xmlDoc.SelectSingleNode("//root/f", xnsm);
if (n != null)
String filename = n.InnerText;
The Visual Studio debugger displays filename = □.txt
It could only be a Visual Studio bug. Unfortunately File.Exists(filename)
returns false, whereas the file actually exist.
What's wrong?
Don't just use the debugger or the console to display the string as a string.
Instead, dump the contents of the string, one character at a time. For example:
foreach (char c in filename)
{
Console.WriteLine("{0}: {1:x4}", c, (int) c);
}
That will show you the real contents of the string, in terms of Unicode code points, instead of being constrained by what the current font can display.
Use the Unicode code charts to look up the characters specified.
If I remember correctly the XmlDocument.Load(string)
method always assumes UTF-8, regardless of the XML encoding.
You would have to create a StreamReader
with the correct encoding and use that as the parameter.
xmlDoc.Load(new StreamReader(
File.Open("file.xml"),
Encoding.GetEncoding("iso-8859-15")));
EDIT:
I just stumbled across KB308061 from Microsoft. There's an interesting passage:
Specify the encoding declaration in
the XML declaration section of the XML
document. For example, the following
declaration indicates that the
document is in UTF-16 Unicode encoding
format:
<?xml version="1.0" encoding="UTF-16"?>
Note that this declaration only
specifies the encoding format of an
XML document and does not modify or
control the actual encoding format of
the data.
Does your xml define its encoding correctly ? encoding="iso-8859-15" .. is that Iso-latin-15
Ideally, you should put your content inside a CDATA element .. so the xml would look like <f><![CDATA[€.txt]]></f>
Ideally, you should also escape all special characters with equivalent url-encoded (or http-encoded) values, because xml typically is for communicating through http.
I dont know the exact escape code for € .. but it would be something of this sort
<f><![CDATA[%3E.txt]]></f>
The above should make € be communicated correctly through the xml.