Identifier normalization: Why is the micro sign co

2019-01-12 05:37发布

问题:

I just stumbled upon the following odd situation:

>>> class Test:
        µ = 'foo'

>>> Test.µ
'foo'
>>> getattr(Test, 'µ')
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    getattr(Test, 'µ')
AttributeError: type object 'Test' has no attribute 'µ'
>>> 'µ'.encode(), dir(Test)[-1].encode()
(b'\xc2\xb5', b'\xce\xbc')

The character I entered is always the µ sign on the keyboard, but for some reason it gets converted. Why does this happen?

回答1:

There are two different characters involved here. One is the MICRO SIGN, which is the one on the keyboard, and the other is GREEK SMALL LETTER MU.

To understand what’s going on, we should take a look at how Python defines identifiers in the language reference:

identifier   ::=  xid_start xid_continue*
id_start     ::=  <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue  ::=  <all characters in id_start, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start    ::=  <all characters in id_start whose NFKC normalization is in "id_start xid_continue*">
xid_continue ::=  <all characters in id_continue whose NFKC normalization is in "id_continue*">

Both our characters, MICRO SIGN and GREEK SMALL LETTER MU, are part of the Ll unicode group (lowercase letters), so both of them can be used at any position in an identifier. Now note that the definition of identifier actually refers to xid_start and xid_continue, and those are defined as all characters in the respective non-x definition whose NFKC normalization results in a valid character sequence for an identifier.

Python apparently only cares about the normalized form of identifiers. This is confirmed a bit below:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

NFKC is a Unicode normalization that decomposes characters into individual parts. The MICRO SIGN decomposes into GREEK SMALL LETTER MU, and that’s exactly what’s going on there.

There are a lot other characters that are also affected by this normalization. One other example is OHM SIGN which decomposes into GREEK CAPITAL LETTER OMEGA. Using that as an identifier gives a similar result, here shown using locals:

>>> Ω = 'bar'
>>> locals()['Ω']
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    locals()['Ω']
KeyError: 'Ω'
>>> [k for k, v in locals().items() if v == 'bar'][0].encode()
b'\xce\xa9'
>>> 'Ω'.encode()
b'\xe2\x84\xa6'

So in the end, this is just something that Python does. Unfortunately, there isn’t really a good way to detect this behavior, causing errors such as the one shown. Usually, when the identifier is only referred to as an identifier, i.e. it’s used like a real variable or attribute, then everything will be fine: The normalization runs every time, and the identifier is found.

The only problem is with string-based access. Strings are just strings, of course there is no normalization happening (that would be just a bad idea). And the two ways shown here, getattr and locals, both operate on dictionaries. getattr() accesses an object’s attribute via the object’s __dict__, and locals() returns a dictionary. And in dictionaries, keys can be any string, so it’s perfectly fine to have a MICRO SIGN or a OHM SIGN in there.

In those cases, you need to remember to perform a normalization yourself. We can utilize unicodedata.normalize for this, which then also allows us to correctly get our value from inside locals() (or using getattr):

>>> normalized_ohm = unicodedata.normalize('NFKC', 'Ω')
>>> locals()[normalized_ohm]
'bar'


回答2:

What Python does here is based on Unicode Standard Annex #31:

Implementations that take normalization and case into account have two choices: to treat variants as equivalent, or to disallow variants.

The rest of the section gives further details, but basically, this means that if a language allows you to have an identifier named µ at all, it should treat the two µ characters MICRO SIGN and GREEK SMALL LETTER MU the same, and it should do so by treating them both as GREEK SMALL LETTER MU.


Most other languages that allow non-ASCII identifiers follow the same standard;1 only a few languages invented their own.2 So, this rule has the advantage of being the same across a wide variety of languages (and potentially being supported by IDEs and other tools).

A case could be made that it really doesn't work as well in a language as reflection-heavy as Python, where strings can be used as identifiers as easily as writing getattr(Test, 'µ'). But if you can read the python-3000 mailing list discussions, around PEP 3131; the only options seriously considered were sticking with ASCII, UAX-31, or Java's minor variation on UAX-31; nobody wanted to invent a new standard just for Python.

The other way to solve this problem would be to add a collections.identifierdict type that's documented to apply the exact same rules for lookup that the compiler applies for identifiers in source, and to use that type in mappings intended to be used as namespaces (e.g., object, module, locals, class definitions). I vaguely remember someone suggesting that, but not having any good motivating examples. If anyone thinks this is a good enough example to revive the idea, they could post it on bugs.python.org or the python-ideas list.


1. Some languages, like ECMAScript and C#, use the "Java standard" instead, which is based on an early form of UAX-31 and adds some minor extensions, like ignoring RTL control codes—but that's close enough.

2. For example, Julia allows Unicode currency and math symbols, and also has rules for mapping between LaTeX and Unicode identifiers—but they explicitly added rules to normalize ɛ and µ to the Greek latters…