' - 码农岛

Short story: I can't get an entity like '𠂉' to store in a MySQL database, either by using a text field in a Ruby on Rails app (with default UTF-8 encoding) or by inputting it directly with a MySQL GUI app.

As far as I can tell, all Chinese characters and radicals can be entered into the database without problem, but not these rarely typed 'character components.' The character mentioned above is unicode U+20089 and html entity 𠂉 I can get it to display on the page by entering <html>𠂉</html> and removing html escaping, but I would like to store it simply as the unicode character and keep the html escaping in place. There are many other Chinese 'components' (parts of full characters, generally consisting of 2 or 3 strokes) that cause the same problem.

According to this page, the character mentioned is in the UTF-8 charset: http://www.fileformat.info/info/unicode/char/20089/charset_support.htm

But on the neighboring '...20089/index.htm' page, there's an alert saying it's not a valid unicode character.

For reference, that entity can be found in Mac OS X by searching through the character palette (international menu, "Show Character Palette"), searching by radical, and looking under the '丿' radical.

Apologies if this is too open-ended... can a character like this be stored in a UTF-8-based database? How is this character both supported and unsupported, both present in the character set and not valid?

标签： mysql ruby-on-rails unicode cjk utf8mb4

4条回答

叼着烟拽天下

2楼-- · 2019-02-11 00:45

what if you double encode it and store ?

get it encoded once again and stored. and later upon retrieval decode it once and render in html.

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2019-02-11 00:46

I can't answer the question of it being listed as both supported and unsupported, that's probably a question for the people running the fileformat.info site.

UTF-8 can be used to represent any Unicode character (code point). This is true of all of the UTFs. The number of bytes required to do so varies (in UTF-8, you need four for the code point you identified, for instance, whereas you only need one for the Roman letter 'A'), but all Unicode characters can be represented by all UTFs. That's what they're for. (More here.)

It sounds as though you're running into an encoding problem at one (or more) of the layers in your app. The first place to look would be the page served by your app: Does it say what charset it's using? It may be worth checking the headers being returned for your pages to see if they have:

Content-Type: text/html; charset="UTF-8"

...in them. If they don't, look for the equivalent meta tag in the HTML itself, though I seem to recall reading that meta isn't a good way to do this. Absent the headers being specific, the default applied will probably be ISO-8859-1 (though some browsers may use Windows-1252 instead), which won't work if your source text is encoded with UTF-8.

The next place to look is your database. I don't think MySQL stores text in UTF-8 by default, you'll need to ensure that it's doing that in your MySQL configuration.

From your question, I don't think you need it, but I'll finish with the obligatory plug for the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (if only to save someone from plugging it in a comment). :-)

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

4楼-- · 2019-02-11 00:57

Which version of MySQL are you using? If it's before 5.5, you can't store that character because it would take four bytes and MySQL only supports up to three bytes UTF-8 (i.e., characters in the BMP). MySQL 5.5 added support for four-byte UTF-8, but you have to specify utf8mb4 as the Character Set.

ref: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html

0人赞添加讨论(0) 举报

该账号已被封号

5楼-- · 2019-02-11 01:07

U+20089 is a defined character in the Unicode set (CJK Unified Ideographs Extension B) and becomes the byte sequence F0 A0 82 89 when encoded as UTF-8. The problem is probably not with the character, but character handling by the software somewhere in your stack.

In the unlikely event that there is an inherent technical reason for this being a problem character, it is likely to be covered in the Unicode standard or in the FAQs.

0人赞添加讨论(0) 举报

'

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间