understanding decode() and encode() unicode [dupli

2019-02-26 02:19发布

问题:

This question already has an answer here:

  • Unicode error Ordinal not in range 1 answer

I just can't get how the functions decode() and encode() work on python2.7

I tried the followings statement

>>> s = u'abcd'
>>> s.encode('utf8')
'abcd'
>>> s.encode('utf16')
'\xff\xfea\x00b\x00c\x00d\x00'
>>> s.encode('utf32')
'\xff\xfe\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00d\x00\x00\x00'

untill here, I think it's clear; encode() translate a unicode code in the corresponding utf-8/16/32 byte string.

But when I code:

>>> s.decode('utf8')
u'abcd'
>>> s.decode('utf16')
u'\u6261\u6463'
>>> s.decode('utf32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_32.py", line 11, in decode
    return codecs.utf_32_decode(input, errors, True)
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: codepoint not in range(0x110000)

why the meaning of decode() on a unicode type? Why does the first (with utf8) work instead the latters not? Is it because python internally stores unicode strings using utf-8?

One last thing:

>>> s2 = '≈'
>>> s2
'\xe2\x89\x88'

What happens under the hood? '≈' is not an ascii character, so does python convert it implicitly using the encoding sys.getfilesystemencoding() returns?

回答1:

You are calling decode on a unicode string. Python helpfully first encodes the string using the default ASCII codec so that you have actual bytes to decode. You cannot decode Unicode data itself, it is already decoded.

That decoding then fails as the bytes are not valid UTF-32 data. The bytestring 'abcd' is decodable as UTF-8, because ASCII is a subset of UTF-8. Encoding to ASCII then decoding as UTF-8 produces the same information. Decoding as UTF-16 happened to work by chance; you provided 4 bytes with hex values 0x61, 0x62, 0x63 and 0x64 (the ASCII values for the characters abcd), and those bytes can be decoded as UTF-16 little endian for \u6261 and \u6463. But there is no valid decoding for those 4 bytes in the UTF-32 encoding system.

If s had data in it that cannot be encoded to ASCII first, you'll get a UnicodeEncodeError exception; note the Encode in that name:

>>> u'åßç'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

because the implicit encoding to a bytestring failed.

In Python 3, unicode objects have been renamed to str, and the str.decode() method has been removed from the type to prevent this kind of confusion. Only str.encode() remains. The Python str type has been replaced by the bytes type, which only has an bytes.decode() method.

Your second example shows that you are using the Python interpreter interactively in a terminal or console. Python received your input from the terminal as UTF-8 bytes and stored those bytes in a bytestring. Had you used a unicode literal, Python would have automatically decoded those bytes using the encoding declared for your terminal; you can introspect sys.stdin.encoding to see what Python detected:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> s = '≈'
>>> s
'\xe2\x89\x88'
>>> s = u'≈'
>>> s
u'\u2248'
>>> print s
≈

Vice-versa, when printing the sys.stdout.encoding codec is used to auto-encode Unicode strings to the codec used by your terminal, which then interprets those bytes again to display the right glyphs on your screen.

If you are not working in the Python interactive interpreter but are instead working with a Python source file, the codec to use is instead determined by the PEP-263 Python source code encodings declaration, as Python 2 otherwise defaults to decoding bytes as ASCII.

sys.getfilesystemencoding() has nothing to do with all this; it tells you what Python think your file system metadata is encoded with; e.g. the filenames in directories. The values is used when you use unicode paths for filesystem-related calls like os.listdir().