I'm reading an HTML document that contains UTF-8 chars but when I access the innerHTML
of the document, all the "bad" chars show up as 0xfffd
. I've tried it in all the major browsers and it behaves the same way. When I alert()
the innerHTML
it shows those chars as a "diamond with a ? mark".
Surprisingly the following works perfectly, correctly displaying the UTF-8 char in the alert box, so its not alert()
is malfunctioning.
alert("Doppelg\u00e4nger!");
Why can't I access the UTF-8 chars using innerHTML
? Or is there another way to access them in JavaScript.
First, check if the document header contains.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
You can also read out the meta-tags with javascript:
var metaTags = document.getElementsByTagName("META");
If it does, this is the explanation of the behavior. You can try changing utf-8 to ISO-8859-1:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Better is to htmlEncode all extended characters in your HTML. Like this:
function encodeHTML(str){
var aStr = str.split(''),
i = aStr.length,
aRet = [];
while (--i) {
var iC = aStr[i].charCodeAt();
if (iC < 65 || iC > 127 || (iC>90 && iC<97)) {
aRet.push('&#'+iC+';');
} else {
aRet.push(aStr[i]);
}
}
return aRet.reverse().join('');
}
Mind you, this function will encode everything that is not [a-zA-Z]. This function will encode Doppelgänger in Doppelgänger for example.
Is the page sent with a UTF-8 charset?
.innerHTML
has never given me any trouble with UTF-8.