using extended Ascii codes with Python

2019-05-20 06:35发布

问题:

I've created a dictionnary with Python but I've got problems with extended Ascii codes.

The loop that creats the dictionnary is : (ascii number 128 to 164 : é,à etc)

#extented ascii codes
i = 128
while i <= 165 :
    dictionnary[chr(i)] = 'extended ascii'
    i = i + 1

But when I try to use dictionnary :

    >>> dictionnary['è']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '\xc3\xa8'

I've got # -- coding: utf-8 -- in the header of the python script. I've tried encode,decode etc but the result is always bad.

To understand what happens, I've tried :

>>> ord('é')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

and

    >>> ord(u'é')
233

I'am confused with ord(u'é') because 'é' is number 130 in extended ascii table and not 233.

I understand that extended ascii codes contains "two characters" but I don't understand how to solve the problem with dictionnary ?

Thanks in advance ! :-)

回答1:

Use unichr instead of chr. The function chr produces a string containing a single byte, whereas unichr produces a string containing a single unicode character. Finally, do lookups using unicode characters too: d[u'é'] because d['é'] will look up the utf-8 encoding of é.

You have 3 things in your code: a latin-1 encoded str, a utf-8 encoded str, and a unicode string. Getting it clear in your head which you've got at any point in time requires a lot of knowledge about how Python works and a decent understanding of Unicode and encodings.

No answer about encodings and Unicode is complete without a link to Joel Spolsky's article on the matter: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)