In Unicode, why are there two representations for

2019-01-08 21:08发布

I was reading the specification of Unicode @ Wikipedia (Arabic Unicode) and I see that each of the Arabic digits has 2 Unicode code points. For example 1 is defined as U+0661 and as U+06F1.

Which one should I use?

4条回答
在下西门庆
2楼-- · 2019-01-08 21:09

Which code do you prefer for representing the number 4, U+0664 or U+06F4?

(٤ or ۴ )?

To be consistent, let this choice guide which codes you use for 1, 2, and the other duplicate codes.

查看更多
神经病院院长
3楼-- · 2019-01-08 21:12

Well, thy look like this: ١ and ۱, so I assume that it doesn't matter much. My guess would be that they have different Unicode codes for the same numeral depending on it's location. In Arabic, they do the same with letters: they look different when they are the last letter of a word or if they stand alone.

Edit: I just noted that the 4 look different in both sets: ٤ and ۴. I'm quite sure that in the Middle East (Jordan and Egypt), they use the first form (U-0664).

查看更多
我只想做你的唯一
4楼-- · 2019-01-08 21:18

In general you should not hard-code such info in your application.

  • On Windows you can use GetLocaleInfo with LOCALE_SNATIVEDIGITS.
  • On Mac CFNumberFormatterCopyProperty with kCFNumberFormatterZeroSymbol.
  • Or use something like ICU.

There are Arabic countries that don't use the Arabic-Indic digits by default. So there is no direct mapping saying Arabic -> Arabic-Indic digits.

And the user might have changed the defaults in the Control Panel anyway.

查看更多
我只想做你的唯一
5楼-- · 2019-01-08 21:20

According to the code charts, U+0660 .. U+0669 are ARABIC-INDIC DIGIT values 0 through 9, while U+06F0 .. U+06F9 are EXTENDED ARABIC-INDIC DIGIT values 0 through 9.

In the Unicode 3.0 book (5.2 is the current version, but these things don't change much once set), the U+066n series of glyphs are marked 'Arabic-Indic digits' and the U+06Fn series of glyphs are marked 'Eastern Arabic-Indic digits (Persian and Urdu)'. It also notes:

  • U+06F4 - 'different glyphs in Persian and Urdu'
  • U+06F5 - 'Persian and Urdu share glyph different from Arabic'
  • U+06F6 - 'Persian glyph different from Arabic'
  • U+06F7 - 'Urdu glyph different from Arabic'

For comparison:

  • U+066n: ٠١٢٣٤٥٦٧٨٩
  • U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or, enlarged by making the information into a title:

U+066n: ٠١٢٣٤٥٦٧٨٩

U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or:

     U+066n    U+06Fn
0      ٠         ۰
1      ١         ۱
2      ٢         ۲
3      ٣         ۳
4      ٤         ۴
5      ٥         ۵
6      ٦         ۶
7      ٧         ۷
8      ٨         ۸
9      ٩         ۹

(Whether you can see any of those, and how clearly they are differentiated may depend on your browser and the fonts installed on your machine as much as anything else. I can see the difference on 4 and 6 clearly; 5 looks much the same in both.)

Based on this information, if you are working with Arabic from the Middle East, use the U+066n series of digits; if you are working with Persian or Urdu, use the U+06Fn series of digits. As a Unicode application, you should accept either set of codes as valid digits (but you might look askance at a sequence that mixed the two sets of digits - or you might just leave well alone).

查看更多
登录 后发表回答