Unescape unicode-escapes, but not carriage returns

2019-09-16 15:04发布

问题:

I have an ASCII-encoded JSON file with unicode-escapes (e.g., \\u201cquotes\\u201d) and newlines escaped within strings, (e.g., `"foo\\r\\nbar"). Is there a simple way in Python to generate a utf-8 encoded file by un-escaping the unicode-escapes, but leaving the newline escapes intact?

Calling decode('unicode-escape') on the string will decode the unicode escapes (which is what I want) but it will also decode the carriage returns and newlines (which I don't want).

回答1:

Sure there is, use the right tool for the job and ask the json module to decode the data to Python unicode; then encode the result to UTF-8:

import json

json.loads(input).encode('utf8')

Use unicode-escape only for actual Python string literals. JSON strings are not the same as Python strings, even though they may, at first glance, look very similar.

Short demo (take into account the python interactive interpreter echoes strings as literals):

>>> json.loads(r'"\u201cquotes\u201d"').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> json.loads(r'"foo\r\nbar"').encode('utf8')
'foo\r\nbar'

Note that the JSON decoder decodes \r on \n just like a python literal would.

If you absolutely have to only process the \uabcd unicode literals in the JSON input but leave the rest intact, then you need to resort to a regular expression:

import re

codepoint = re.compile(r'(\\u[0-9a-fA-F]{4})')
def replace(match):
    return unichr(int(match.group(1)[2:], 16))

codepoint.sub(replace, text).encode('utf8')

which gives:

>>> codepoint.sub(replace, r'\u201cquotes\u201d').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> codepoint.sub(replace, r'"foo\r\nbar"').encode('utf8')
'"foo\\r\\nbar"'


标签: python utf-8