I have a text file with text that should have been interpreted as utf-8 but wasn't (it was given to me this way). Here is an example of a typical line of the file:
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
which should have been:
ロンドン在住
Now, I can do it manually on python by typing the following in the command line:
>>> h1 = u'\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'
>>> print h1
ロンドン在住
which gives me what I want. Is there a way that I can do this automatically? I've tried doing stuff like this
>>> f = codecs.open('testfile.txt', encoding='utf-8')
>>> h = f.next()
>>> print h
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
I've also tried with the 'encode' and 'decode' functions, any ideas?
Thanks!
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
is not UTF8; it's using the python unicode escape format. Use theunicode_escape
codec instead:Here is the UTF-8 encoding of the above phrase, for comparison:
Note that the data decoded with
unicode_escape
are treated as Latin-1 for anything that's not a recognised Python escape sequence.Be careful however; it may be you are really looking at JSON-encoded data, which uses the same notation for specifying character escapes. Use
json.loads()
to decode actual JSON data; JSON strings with such escapes are delimited with"
quotes and are usually part of larger structures (such as JSON lists or objects).