This has been confusing me for some time. With the advent of UTF-8 as the de-facto standard in web development I'm not sure in which situations I'm supposed to use the html entities and for which ones should I just use the UTF-8 character.
Examples: em dash, ampersand, etc.
Please do shed light on this issue. It will be appreciated.
Entities may buy you some compatibility with brain-dead clients that don't understand encodings correctly. I don't believe that includes any current browsers, but you never know what other kinds of programs might be hitting you up.
More useful, though, is that HTML entities protect you from your own errors: if you misconfigure something on the server and you end up serving a page with an HTTP header that says it's
ISO-8859-1
and aMETA
tag that says it'sUTF-8
, at least your —es will always work.If your pages are correctly encoded in utf-8 you should have no need for html entities, just use the characters you want directly.
I would not use UTF-8 for characters that are easily confused visually. For example, it is difficult to distinguish an emdash from a minus, or especially a non-breaking space from a space. For these characters, definitely use entities.
For characters that are easily understood visually (such as the chinese examples above), go ahead and use UTF-8 if you like.
Based on the comments I have received, I looked into this a little further. It seems that currently the best practice is to forgo using HTML entities and use the actual UTF-8 character instead. The reasons listed are as follows:
As long as your page's encoding is properly set to UTF-8, you should use the actual character instead of an HTML entity. I read several documents about this topic, but the most helpful were:
From the UTF-8: The Secret of Character Encoding article:
That article also gives a nice example involving Chinese encoding. Here is the abbreviated example for the sake of laziness:
UTF-8:
這兩個字是甚麼意思
HTML Entities:
這兩個字是甚麼意思
The UTF-8 and HTML entity encodings are both meaningless to me, but at least the UTF-8 encoding is recognizable as a foreign language, and it will render properly in an edit box. The article goes on to say the following about the HTML entity-encoded version:
As others have noted, you still have to use HTML entities for reserved XML characters (ampersand, less-than, greater-than).
Personally I do everything in utf-8 since a long time, however, in an html page, you always need to convert ampersands (&), greater than (>) and lesser then (<) characters to their equivalent entities, &, > and <
Also, if you intend on doing some programming using utf-8 text, there are a few thing to watch for.
HTML entities are useful when you want to generate content that is going to be included (dynamically) into pages with (several) different encodings. For example, we have white label content that is included both into ISO-8859-1 and UTF-8 encoded web pages...
If character set conversion from/to UTF-8 wasn't such a big unreliable mess (you always stumble over some characters and some tools that don't convert properly), standardizing on UTF-8 would be the way to go.