Are uppercase utf8 characters always the same numb

2020-07-08 06:40发布

Obviously it is true for the latin alphabet. But I'm asking this in a conceptual sense, across languages and the Unicode spec.

Practically this came up for comparing two strings. If you already know they aren't the same number of bytes—across all languages—can you consider that enough of a guarantee that they are not differently "cased" versions of the same string?

标签： unicode utf-8 case-insensitive

2条回答

淡お忘

2楼-- · 2020-07-08 07:30

No.

Consider U+0069 "i" which has the octet value 69 in UTF-8. In the uppercase form U+0130 "İ" this code point forms the UTF-8 sequence C4 B0.

Obligatory note: case is locale-sensitive.

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2020-07-08 07:32

There is no principle or invariant in the Unicode standard that guarantees this. I would be particularly concerned about accented capitals, where there may be a mismatch between precomposition and non-precomposition across cases. However, I can't cite an example of a problem for you.

0人赞添加讨论(0) 举报

Are uppercase utf8 characters always the same numb

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间