Why is `'↊'.isnumeric()` false?

2020-08-26 03:57发布

问题:

According to the Official Unicode Consortium code chart, all of these are numeric:

⅐   ⅑   ⅒   ⅓   ⅔   ⅕   ⅖   ⅗   ⅘   ⅙   ⅚   ⅛   ⅜   ⅝   ⅞   ⅟
Ⅰ   Ⅱ   Ⅲ   Ⅳ   Ⅴ   Ⅵ   Ⅶ   Ⅷ   Ⅸ   Ⅹ   Ⅺ   Ⅻ   Ⅼ   Ⅽ   Ⅾ   Ⅿ
ⅰ   ⅱ   ⅲ   ⅳ   ⅴ   ⅵ   ⅶ   ⅷ   ⅸ   ⅹ   ⅺ   ⅻ   ⅼ   ⅽ   ⅾ   ⅿ
ↀ   ↁ   ↂ   Ↄ   ↄ   ↅ   ↆ   ↇ   ↈ   ↉   ↊   ↋

However, when I ask Python to tell me which ones are numeric, they all are (even ) except for four:

In [252]: print([k for k in "⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿↀↁↂↃↄↅↆↇↈ↉↊↋" if not k.isnumeric()])
['Ↄ', 'ↄ', '↊', '↋']

Those are:

  • Ↄ Roman Numeral Reversed One Hundred
  • ↄ Latin Small Letter Reversed C
  • ↊ Turned Digit Two
  • ↋ Turned Digit Three

Why does Python consider those to be not numeric?

回答1:

str.isnumeric is documented to be true for "all characters that have the Unicode numeric value property".

The canonical reference for that property is the Unicode Character Database. The information we need can be dug out of http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt , which is the latest version at time of writing (late 2016) (warning: 1.5MB text file). It's a little tricky to read (the documentation is in UAX#44). I'm going to show its entry for a character that is numeric first, U+3023 HANGZHOU NUMERAL THREE ()

3023;HANGZHOU NUMERAL THREE;Nl;0;L;;;;3;N;;;;;

The eighth semicolon-separated field is the "numeric value" property; in this case, its value is 3, consistent with the name of the character. Python's str.isnumeric is true if and only if this field is nonempty. It can be interrogated directly using unicodedata.numeric.

The third semicolon-separated field is a two-character code giving the "general category"; in this case, "Nl". Most, but not all, of the characters with a numeric value are in one of the "number" categories (first letter of the category code is a N). The exceptions are all hanzi that, depending on context, may or may not signify a number; see UAX#38.

Now, the characters you are asking about:

2183;ROMAN NUMERAL REVERSED ONE HUNDRED;Lu;0;L ;;;;;N;;;    ;2184;
2184;LATIN SMALL LETTER REVERSED C     ;Ll;0;L ;;;;;N;;;2183;    ;2183
218A;TURNED DIGIT TWO                  ;So;0;ON;;;;;N;;;    ;    ;
218B;TURNED DIGIT THREE                ;So;0;ON;;;;;N;;;    ;    ;

These characters do not have a numeric value assigned, so Python's behavior is correct-as-documented.

Note: per https://docs.python.org/3.6/whatsnew/3.6.html, Python will only be updated to Unicode 9.0.0 in the 3.6 release; however, AFAICT these characters have not changed in quite some time.

("Why don't these characters have a numeric value?" is a question that only the Unicode Consortium can answer definitively; if you are interested I suggest bringing it up on one of their mailing lists.)