I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt:
\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89
But, no matter how I import it into Python (3 or 2), I get this string, at best:
\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89
I have tried, like other threads have suggested, to re-encode the string as UTF-8 and then decode it with unicode escape, like so:
stringName.encode("utf-8").decode("unicode_escape")
But then it messes up the original encoding, and gives this as the string:
'æ\x89\x8eå\x8a\xa0æ\x8b\x89' (printing this string results in: æå æ )
Now, if I manually copy and paste b + the original string in the filename and encode this, I get the correct encoding. For example:
b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'.encode("utf-8")
Results in: '扎加拉'
But, I can't do this programmatically. I can't even get rid of the double slashes.
To be clear, source.txt contains single backslashes. I have tried importing it in many ways, but this is the most common:
with open('source.txt','r',encoding='utf-8') as f_open:
source = f_open.read()
Okay, so I clicked the answer below (I think), but here is what works:
from ast import literal_eval
decodedString = literal_eval("b'{}'".format(stringVariable)).decode('utf-8')
I can't use it on the whole file because of other encoding issues, but extracting each name as a string (stringVariable) and then doing that works! Thank you!
To be more clear, the original file is not just these messed up utf encodings. It only uses them for certain fields. For example, here is the beginning of the file:
{'m_cacheHandles': ['s2ma\x00\x00CN\x1f\x1b"\x8d\xdb\x1fr \\\xbf\xd4D\x05R\x87\x10\x0b\x0f9\x95\x9b\xe8\x16T\x81b\xe4\x08\x1e\xa8U\x11',
's2ma\x00\x00CN\x1a\xd9L\x12n\xb9\x8aL\x1d\xe7\xb8\xe6\xf8\xaa\xa1S\xdb\xa5+\t\xd3\x82^\x0c\x89\xdb\xc5\x82\x8d\xb7\x0fv',
's2ma\x00\x00CN\x92\xd8\x17D\xc1D\x1b\xf6(\xedj\xb7\xe9\xd1\x94\x85\xc8`\x91M\x8btZ\x91\xf65\x1f\xf9\xdc\xd4\xe6\xbb',
's2ma\x00\x00CN\xa1\xe9\xab\xcd?\xd2PS\xc9\x03\xab\x13R\xa6\x85u7(K2\x9d\x08\xb8k+\xe2\xdeI\xc3\xab\x7fC',
's2ma\x00\x00CNN\xa5\xe7\xaf\xa0\x84\xe5\xbc\xe9HX\xb93S*sj\xe3\xf8\xe7\x84`\xf1Ye\x15~\xb93\x1f\xc90',
's2ma\x00\x00CN8\xc6\x13F\x19\x1f\x97AH\xfa\x81m\xac\xc9\xa6\xa8\x90s\xfdd\x06\rL]z\xbb\x15\xdcI\x93\xd3V'],
'm_campaignIndex': 0,
'm_defaultDifficulty': 7,
'm_description': '',
'm_difficulty': '',
'm_gameSpeed': 4,
'm_imageFilePath': '',
'm_isBlizzardMap': True,
'm_mapFileName': '',
'm_miniSave': False,
'm_modPaths': None,
'm_playerList': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92, 'm_r': 36},
'm_control': 2,
'm_handicap': 0,
'm_hero': '\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89',
All of the information before the 'm_hero': field is not utf-8. So using ShadowRanger's solution works if the file is only made up of these fake utf-encodings, but it doesn't work when I have already parsed m_hero as a string and try to convert that. Karin's solution does work for that.
Solution in Python3 with only string manipulations and encoding conversions without evil
eval
:)If you like an one-liner, then we can put it simply as:
I'm assuming you're using Python 3. In Python 2, strings are bytes by default, so it would just work for you. But in Python 3, strings are unicode and interpretted as unicode, which is what makes this problem harder if you have a byte string being read as unicode.
This solution was inspired by mgilson's answer. We can literally evaluate your unicode string as a byte string by using
literal_eval
:The problem is that the
unicode_escape
codec is implicitly decoding the result of the escape fixes by assuming the bytes arelatin-1
, notutf-8
. You can fix this by:Which (assuming the file contains the literal backslashes and codes, not the bytes they represent) leaves you with
'\u624e\u52a0\u62c9'
(Which should be correct, I'm just on a system without font support for those characters, so that's just the saferepr
based on Unicode escapes). You could skip a step in Py2 by using thestring-escape
codec for the first stagedecode
(which I believe would allow you to omit the.encode('latin-1')
step), but this solution should be portable, and the cost shouldn't be terrible.at the end of day, what you get back is a string right? i would use string.replace method to convert double slash to single slash and add b prefix to make it work.
You can do some silly things like
eval
uating the string:ast.literal_eval
if you don't want attackers to gain access to your system :-PUsing this in your case would probably look something like:
I think that the real issue here is likely that you have a file that contains strings representing bytes (rather than having a file that just stores the bytes themselves). So, fixing whatever code generated that file in the first place is probably a better bet. However, barring that, this is the next best thing that I could come up with ...
So there are several different ways to interpret having the data "in byte form." Let's assume you really do:
The
b
prefix indicates those are bytes. Without getting into the whole mess that is bytes vs codepoints/characters and the long differences between Python 2 and 3, theb
-prefixed string indicates those are intended to be bytes (e.g. raw UTF-8 bytes).Then just decode it, which converts UTF-8 encoding (which you already have in the bytes, into true Unicode characters. In Python 2.7, e.g.:
yields:
One of your examples did an encode followed by a decode, which can only lead to sorrow and pain. If your variable holds true UTF-8 bytes, you only need the decode.
Update Based on discussion, it appears the data isn't really in UTF-8 bytes, but a string-serialized version of same. There are a lot of ways to get from string serial to bytes. Here's mine:
Then:
as before yields:
This
byteize()
isn't as general as theliteral_eval()
-based accepted answer, but%timeit
benchmarking shows it to be about 33% faster on short strings. It could be further accelerated by swapping outrange
forxrange
under Python 2. Theliteral_eval
approach wins handily on long strings, however, given its lower-level nature.