Evaluate UTF-8 literal escape sequences in a strin

2020-02-10 08:14发布

问题:

I have a string of the form:

s = '\\xe2\\x99\\xac'

I would like to convert this to the character ♬ by evaluating the escape sequence. However, everything I've tried either results in an error or prints out garbage. How can I force Python to convert the escape sequence into a literal unicode character?

What I've read elsewhere suggests that the following line of code should do what I want, but it results in a UnicodeEncodeError.

print(bytes(s, 'utf-8').decode('unicode-escape'))

I also tried the following, which has the same result:

import codecs
print(codecs.getdecoder('unicode_escape')(s)[0])

Both of these approaches produce the string 'â\x99¬', which print is subsequently unable to handle.

In case it makes any difference the string is being read in from a UTF-8 encoded file and will ultimately be output to a different UTF-8 encoded file after processing.

回答1:

...decode('unicode-escape') will give you string '\xe2\x99\xac'.

>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape')
'â\x99¬'
>>> _ == '\xe2\x99\xac'
True

You need to decode it. But to decode it, encode it first with latin1 (or iso-8859-1) to preserve the bytes.

>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
'♬'