Short story: I can't get an entity like '𠂉' to store in a MySQL database, either by using a text field in a Ruby on Rails app (with default UTF-8 encoding) or by inputting it directly with a MySQL GUI app.
As far as I can tell, all Chinese characters and radicals can be entered into the database without problem, but not these rarely typed 'character components.' The character mentioned above is unicode U+20089 and html entity 𠂉
I can get it to display on the page by entering <html>𠂉</html>
and removing html escaping, but I would like to store it simply as the unicode character and keep the html escaping in place. There are many other Chinese 'components' (parts of full characters, generally consisting of 2 or 3 strokes) that cause the same problem.
According to this page, the character mentioned is in the UTF-8 charset: http://www.fileformat.info/info/unicode/char/20089/charset_support.htm
But on the neighboring '...20089/index.htm' page, there's an alert saying it's not a valid unicode character.
For reference, that entity can be found in Mac OS X by searching through the character palette (international menu, "Show Character Palette"), searching by radical, and looking under the '丿' radical.
Apologies if this is too open-ended... can a character like this be stored in a UTF-8-based database? How is this character both supported and unsupported, both present in the character set and not valid?
what if you double encode it and store ?
get it encoded once again and stored. and later upon retrieval decode it once and render in html.
I can't answer the question of it being listed as both supported and unsupported, that's probably a question for the people running the fileformat.info site.
UTF-8 can be used to represent any Unicode character (code point). This is true of all of the UTFs. The number of bytes required to do so varies (in UTF-8, you need four for the code point you identified, for instance, whereas you only need one for the Roman letter 'A'), but all Unicode characters can be represented by all UTFs. That's what they're for. (More here.)
It sounds as though you're running into an encoding problem at one (or more) of the layers in your app. The first place to look would be the page served by your app: Does it say what charset it's using? It may be worth checking the headers being returned for your pages to see if they have:
...in them. If they don't, look for the equivalent
meta
tag in the HTML itself, though I seem to recall reading thatmeta
isn't a good way to do this. Absent the headers being specific, the default applied will probably be ISO-8859-1 (though some browsers may use Windows-1252 instead), which won't work if your source text is encoded with UTF-8.The next place to look is your database. I don't think MySQL stores text in UTF-8 by default, you'll need to ensure that it's doing that in your MySQL configuration.
From your question, I don't think you need it, but I'll finish with the obligatory plug for the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (if only to save someone from plugging it in a comment). :-)
Which version of MySQL are you using? If it's before 5.5, you can't store that character because it would take four bytes and MySQL only supports up to three bytes UTF-8 (i.e., characters in the BMP). MySQL 5.5 added support for four-byte UTF-8, but you have to specify
utf8mb4
as the Character Set.ref: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html
U+20089 is a defined character in the Unicode set (CJK Unified Ideographs Extension B) and becomes the byte sequence
F0 A0 82 89
when encoded as UTF-8. The problem is probably not with the character, but character handling by the software somewhere in your stack.In the unlikely event that there is an inherent technical reason for this being a problem character, it is likely to be covered in the Unicode standard or in the FAQs.