In Python 3, how do I interpolate a byte string into a regular string and get the same behavior as Python 2 (i.e.: get just the escape codes without the b
prefix or double backslashes)?
e.g.:
Python 2.7:
>>> x = u'\u041c\u0438\u0440'.encode('utf-8')
>>> str(x)
'\xd0\x9c\xd0\xb8\xd1\x80'
>>> 'x = %s' % x
'x = \xd0\x9c\xd0\xb8\xd1\x80'
Python 3.3:
>>> x = u'\u041c\u0438\u0440'.encode('utf-8')
>>> str(x)
"b'\\xd0\\x9c\\xd0\\xb8\\xd1\\x80'"
>>> 'x = %s' % x
"x = b'\\xd0\\x9c\\xd0\\xb8\\xd1\\x80'"
Note how with Python 3, I get the b
prefix in my output and double underscores. The result that I would like to get is the result that I get in Python 2.
In Python 2 you have types str
and unicode
. str
represents a simple byte string while unicode
is a Unicode string.
For Python 3, this changed: Now str
is what was unicode
in Python 2 and byte
is what was str
in Python 2.
So when you do ("x = %s" % '\u041c\u0438\u0440').encode("utf-8")
you can actually omit the u
prefix, as it is implicit. Everything that is not explicitly converted in python is unicode.
This will yield your last line in Python 3:
("x = %s" % '\u041c\u0438\u0440').encode("utf-8")
Now how I encode after the final result, which is what you should always do: Take an incoming object, decode it to unicode (how ever you do that) and then, when making an output, encode it in the encoding of your choice. Don't try to handle raw byte strings. That is just ugly and deprecated behaviour.
In your Python 3 example, you are interpolating into a Unicode string, not a byte string like you are doing in Python 2.
In Python 3, bytes
do not support interpolation (string formatting or what-have-you).
Either concatenate, or use Unicode all through and only encode when you have interpolated:
b'x = ' + x
or
'x = {}'.format(x.decode('utf8')).encode('utf8')
or
x = '\u041c\u0438\u0440' # the u prefix is ignored in Python 3.3
'x = {}'.format(x).encode('utf8')
In Python 2, byte strings and regular strings are the same so there's no conversion done by str()
. In Python 3 a string is always a Unicode string, so str()
of a byte string does a conversion.
You can do your own conversion instead that does what you want:
x2 = ''.join(chr(c) for c in x)