I'm really confused. I tried to encode but the error said can't decode...
.
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
I know how to avoid the error with "u" prefix on the string. I'm just wondering why the error is "can't decode" when encode was called. What is Python doing under the hood?
You can try this
Or
You can also try following
Add following line at top of your .py file.
If you're using Python < 3, you'll need to tell the interpreter that your string literal is Unicode by prefixing it with a
u
:Further reading: Unicode HOWTO.
In case you're dealing with Unicode, sometimes instead of
encode('utf-8')
, you can also try to ignore the special characters, e.g.or as
something.decode('unicode_escape').encode('ascii','ignore')
as suggested here.Not particularly useful in this example, but can work better in other scenarios when it's not possible to convert some special characters.
Alternatively you can consider replacing particular character using
replace()
.Always encode from unicode to bytes.
In this direction, you get to choose the encoding.
The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.
This point can't be stressed enough. If you want to avoid playing unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:
decode
on it.encode
on it.Now, on seeing
.encode
on a byte string, Python 2 first tries to implicitly convert it to text (aunicode
object). Similarly, on seeing.decode
on a unicode string, Python 2 implicitly tries to convert it to bytes (astr
object).These implicit conversions are why you can get
Unicode
Decode
Error
when you've calledencode
. It's because encoding usually accepts a parameter of typeunicode
; when receiving astr
parameter, there's an implicit decoding into an object of typeunicode
before re-encoding it with another encoding. This conversion chooses a default 'ascii' decoder†, giving you the decoding error inside an encoder.In fact, in Python 3 the methods
str.decode
andbytes.encode
don't even exist. Their removal was a [controversial] attempt to avoid this common confusion.† ...or whatever coding
sys.getdefaultencoding()
mentions; usually this is 'ascii'If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.
Call
locale charmap
from the shell (not the python interpreter) and you should seeIf this is not the case, and you see something else, e.g.
Python will (at least in some cases such as in mine) inherit the shell's encoding and will not be able to print (some? all?) unicode characters. Python's own default encoding that you see and control via
sys.getdefaultencoding()
andsys.setdefaultencoding()
is in this case ignored.If you find that you have this problem, you can fix that by
(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit
/etc/locale.conf
(or whichever file governs the locale definition in your system) to correct this.You use
u"你好".encode('utf8')
to encode an unicode string. But if you want to represent"你好"
, you should decode it. Just like:You will get what you want. Maybe you should learn more about encode & decode.