I have an ASCII-encoded JSON file with unicode-escapes (e.g., \\u201cquotes\\u201d
) and newlines escaped within strings, (e.g., `"foo\\r\\nbar"
). Is there a simple way in Python to generate a utf-8 encoded file by un-escaping the unicode-escapes, but leaving the newline escapes intact?
Calling decode('unicode-escape')
on the string will decode the unicode escapes (which is what I want) but it will also decode the carriage returns and newlines (which I don't want).
Sure there is, use the right tool for the job and ask the json
module to decode the data to Python unicode
; then encode the result to UTF-8:
import json
json.loads(input).encode('utf8')
Use unicode-escape
only for actual Python string literals. JSON strings are not the same as Python strings, even though they may, at first glance, look very similar.
Short demo (take into account the python interactive interpreter echoes strings as literals):
>>> json.loads(r'"\u201cquotes\u201d"').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> json.loads(r'"foo\r\nbar"').encode('utf8')
'foo\r\nbar'
Note that the JSON decoder decodes \r
on \n
just like a python literal would.
If you absolutely have to only process the \uabcd
unicode literals in the JSON input but leave the rest intact, then you need to resort to a regular expression:
import re
codepoint = re.compile(r'(\\u[0-9a-fA-F]{4})')
def replace(match):
return unichr(int(match.group(1)[2:], 16))
codepoint.sub(replace, text).encode('utf8')
which gives:
>>> codepoint.sub(replace, r'\u201cquotes\u201d').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> codepoint.sub(replace, r'"foo\r\nbar"').encode('utf8')
'"foo\\r\\nbar"'