Converting double slash utf-8 encoding

2019-01-28 08:22发布

I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt:

\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89

But, no matter how I import it into Python (3 or 2), I get this string, at best:

\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89

I have tried, like other threads have suggested, to re-encode the string as UTF-8 and then decode it with unicode escape, like so:

stringName.encode("utf-8").decode("unicode_escape")

But then it messes up the original encoding, and gives this as the string:

'æ\x89\x8eå\x8a\xa0æ\x8b\x89' (printing this string results in: æå æ )

Now, if I manually copy and paste b + the original string in the filename and encode this, I get the correct encoding. For example:

b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'.encode("utf-8")

Results in: '扎加拉'

But, I can't do this programmatically. I can't even get rid of the double slashes.

To be clear, source.txt contains single backslashes. I have tried importing it in many ways, but this is the most common:

with open('source.txt','r',encoding='utf-8') as f_open:
    source = f_open.read()

Okay, so I clicked the answer below (I think), but here is what works:

from ast import literal_eval
decodedString = literal_eval("b'{}'".format(stringVariable)).decode('utf-8')

I can't use it on the whole file because of other encoding issues, but extracting each name as a string (stringVariable) and then doing that works! Thank you!

To be more clear, the original file is not just these messed up utf encodings. It only uses them for certain fields. For example, here is the beginning of the file:

{'m_cacheHandles': ['s2ma\x00\x00CN\x1f\x1b"\x8d\xdb\x1fr \\\xbf\xd4D\x05R\x87\x10\x0b\x0f9\x95\x9b\xe8\x16T\x81b\xe4\x08\x1e\xa8U\x11',
                's2ma\x00\x00CN\x1a\xd9L\x12n\xb9\x8aL\x1d\xe7\xb8\xe6\xf8\xaa\xa1S\xdb\xa5+\t\xd3\x82^\x0c\x89\xdb\xc5\x82\x8d\xb7\x0fv',
                's2ma\x00\x00CN\x92\xd8\x17D\xc1D\x1b\xf6(\xedj\xb7\xe9\xd1\x94\x85\xc8`\x91M\x8btZ\x91\xf65\x1f\xf9\xdc\xd4\xe6\xbb',
                's2ma\x00\x00CN\xa1\xe9\xab\xcd?\xd2PS\xc9\x03\xab\x13R\xa6\x85u7(K2\x9d\x08\xb8k+\xe2\xdeI\xc3\xab\x7fC',
                's2ma\x00\x00CNN\xa5\xe7\xaf\xa0\x84\xe5\xbc\xe9HX\xb93S*sj\xe3\xf8\xe7\x84`\xf1Ye\x15~\xb93\x1f\xc90',
                's2ma\x00\x00CN8\xc6\x13F\x19\x1f\x97AH\xfa\x81m\xac\xc9\xa6\xa8\x90s\xfdd\x06\rL]z\xbb\x15\xdcI\x93\xd3V'],
'm_campaignIndex': 0,
'm_defaultDifficulty': 7,
'm_description': '',
'm_difficulty': '',
'm_gameSpeed': 4,
'm_imageFilePath': '',
'm_isBlizzardMap': True,
'm_mapFileName': '',
'm_miniSave': False,
'm_modPaths': None,
'm_playerList': [{'m_color': {'m_a': 255, 'm_b': 255, 'm_g': 92,   'm_r': 36},
               'm_control': 2,
               'm_handicap': 0,
               'm_hero': '\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89',

All of the information before the 'm_hero': field is not utf-8. So using ShadowRanger's solution works if the file is only made up of these fake utf-encodings, but it doesn't work when I have already parsed m_hero as a string and try to convert that. Karin's solution does work for that.

6条回答
姐就是有狂的资本
2楼-- · 2019-01-28 08:29

Solution in Python3 with only string manipulations and encoding conversions without evil eval :)

import binascii

str = '\\xe6\\x89\\x8e\\xe5\\x8a\\xa0\\xe6\\x8b\\x89'
str = str.replace('\\x', '')  # str == 'e6898ee58aa0e68b89'

# we can use any encoding as long as it translate ascii as is,
# for example we can do str.encode('ascii') here
str = str.encode('utf8')  # str == b'e6898ee58aa0e68b89'

str = binascii.a2b_hex(str)  # str == b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
str = str.decode('utf8')  # str == '扎加拉'

If you like an one-liner, then we can put it simply as:

binascii.a2b_hex(str.replace('\\x', '').encode()).decode('utf8')
查看更多
走好不送
3楼-- · 2019-01-28 08:35

I'm assuming you're using Python 3. In Python 2, strings are bytes by default, so it would just work for you. But in Python 3, strings are unicode and interpretted as unicode, which is what makes this problem harder if you have a byte string being read as unicode.

This solution was inspired by mgilson's answer. We can literally evaluate your unicode string as a byte string by using literal_eval:

from ast import literal_eval

with open('source.txt', 'r', encoding='utf-8') as f_open:
    source = f_open.read()
    string = literal_eval("b'{}'".format(source)).decode('utf-8')
    print(string)  # 扎加拉
查看更多
smile是对你的礼貌
4楼-- · 2019-01-28 08:39

The problem is that the unicode_escape codec is implicitly decoding the result of the escape fixes by assuming the bytes are latin-1, not utf-8. You can fix this by:

# Read the file as bytes:
with open(myfile, 'rb') as f:
    data = f.read()

# Decode with unicode-escape to get Py2 unicode/Py3 str, but interpreted
# incorrectly as latin-1
badlatin = data.decode('unicode-escape')

# Encode back as latin-1 to get back the raw bytes (it's a 1-1 encoding),
# then decode them properly as utf-8
goodutf8 = badlatin.encode('latin-1').decode('utf-8')

Which (assuming the file contains the literal backslashes and codes, not the bytes they represent) leaves you with '\u624e\u52a0\u62c9' (Which should be correct, I'm just on a system without font support for those characters, so that's just the safe repr based on Unicode escapes). You could skip a step in Py2 by using the string-escape codec for the first stage decode (which I believe would allow you to omit the .encode('latin-1') step), but this solution should be portable, and the cost shouldn't be terrible.

查看更多
Animai°情兽
5楼-- · 2019-01-28 08:41

at the end of day, what you get back is a string right? i would use string.replace method to convert double slash to single slash and add b prefix to make it work.

查看更多
Evening l夕情丶
6楼-- · 2019-01-28 08:48

You can do some silly things like evaluating the string:

import ast
s = r'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'
print ast.literal_eval('"%s"' % s).decode('utf-8')
  • note use ast.literal_eval if you don't want attackers to gain access to your system :-P

Using this in your case would probably look something like:

with open('file') as file_handle:
    data = ast.literal_eval('"%s"' % file.read()).decode('utf-8')

I think that the real issue here is likely that you have a file that contains strings representing bytes (rather than having a file that just stores the bytes themselves). So, fixing whatever code generated that file in the first place is probably a better bet. However, barring that, this is the next best thing that I could come up with ...

查看更多
Root(大扎)
7楼-- · 2019-01-28 08:54

So there are several different ways to interpret having the data "in byte form." Let's assume you really do:

s = b'\xe6\x89\x8e\xe5\x8a\xa0\xe6\x8b\x89'

The b prefix indicates those are bytes. Without getting into the whole mess that is bytes vs codepoints/characters and the long differences between Python 2 and 3, the b-prefixed string indicates those are intended to be bytes (e.g. raw UTF-8 bytes).

Then just decode it, which converts UTF-8 encoding (which you already have in the bytes, into true Unicode characters. In Python 2.7, e.g.:

print s.decode('utf-8')

yields:

扎加拉

One of your examples did an encode followed by a decode, which can only lead to sorrow and pain. If your variable holds true UTF-8 bytes, you only need the decode.

Update Based on discussion, it appears the data isn't really in UTF-8 bytes, but a string-serialized version of same. There are a lot of ways to get from string serial to bytes. Here's mine:

from struct import pack

def byteize(s):
    """
    Given a backslash-escaped string serialization of bytes,
    decode it into a genuine byte string.
    """
    bvals = [int(s[i:i+2], 16) for i in range(2, len(s), 4)]
    return pack(str(len(bvals)) + 'B', *bvals)

Then:

print byteize(s).decode('utf-8')

as before yields:

扎加拉

This byteize() isn't as general as the literal_eval()-based accepted answer, but %timeit benchmarking shows it to be about 33% faster on short strings. It could be further accelerated by swapping out range for xrange under Python 2. The literal_eval approach wins handily on long strings, however, given its lower-level nature.

100000 loops, best of 3: 6.19 µs per loop
100000 loops, best of 3: 8.3 µs per loop
查看更多
登录 后发表回答