I am trying to read a xml stream and load it into a collection.
This works but Im having difficulties reading special chars.
E.g. if my xml looks like this
<?xml version="1.0" encoding="ISO-8859-1" ?>
<persons>
<person>
<firstname>
<![CDATA[ Sébastien ]]>
</firstname>
<lastname>
<![CDATA[Ørvåk]]>
</lastname>
</person>
</persons>
I try to read the values using linq like
var persons = from p in doc.Elements("persons").Elements("person") select p;
string firstname = person.Element("firstname").Value;
string lastname = person.Element("lastname").Value;
but in Ørvåk Ø and å / Sébastien the é gives strange chars.
Does anyone know whats wrong? I guess it doesnt use the encoding ISO-8859-1.
Thanks
It's possible the file isn't in ISO-8859-1 but is in UTF-8. Can you provide a hex dump of the contents? Sometimes the writer of an XML file isn't careful about the encoding string.
Also, it could be that the XML document comes via HTTP, and the HTTP headers declare the encoding improperly. Section 4.3.3 in the XML specification states that MIME rules override what the document itself states.
If you point your own code at the link instead of your local copy, it could mean your local web server isn't configured properly...
The XML file you mentioned in your follow-up is perfectly correct. So, your bug is specific to your Javascript code.
To expand on an answer someone else gave:
There are two possibilities:
UTF-8
, but is being interpreted by your xml parser asISO-8859-1
.ISO-8859-1
but is being interpreted by your xml parser asUTF-8
.To determine which is which, look at what happens with the
é
inSébastien
. There are two possibilities I can imagine:é
" becomes two different characters - probably "é
"é
" becomes a single nonsense charact or "?
", and possibly the "b
" is also missing from the nameSébastien
.In the first case, your file is not what you think it is. (It is getting to your program as
UTF-8
data, but your program is trying to interpret it asISO-8859-1
) Look at the xml file with a hex editor or something else that can show you what the bytes on the disk are.In the second case, I'd check how the HTTP server on localhost is serving this file. (Your program is getting bytes in
ISO-8859-1
format, but is interpreting them asUTF-8
) The easiest way to do that on windows is to open up acmd
prompt, and run the command:telnet localhost 80
When that pops up a window, type the following line (or cut-and-paste from stackoverflow) and press enter twice. Warning: You won't be able to see what you're typing, and capitalization is important.
In the response, look for a line beginning with
Content-Type
. That will tell you how the webserver locally is serving up the file.Update: Having looked at your file, it really is iso-8859-1, so what I would suggest is setting the .Encoding attribute of your
Webclient
instance like so before you tell it to download the file:Alternatively, you could use the
DownloadBytes
methods instead of theDownloadString
methods, and then parse the bytes into an xml file. The problem currently is that by the time the xml parser gets the file contents, the bytes have already been interpreted as a string, so it's too late to change the encoding there.