How to convert \uXXXX unicode to UTF-8 using conso

2019-01-13 01:53发布

I use curl to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń) and \u00f3 (ó).

How can I convert them to UTF-8 or any other encoding to save into file?

9条回答
别忘想泡老子
2楼-- · 2019-01-13 01:57

Works on Windows, should work on *nix too. Uses python 2.

#!/usr/bin/env python
from __future__ import unicode_literals
import sys
import json
import codecs

def unescape_json(fname_in, fname_out):
    with file(fname_in, 'rb') as fin:
        js = json.load(fin)
    with codecs.open(fname_out, 'wb', 'utf-8') as fout:
        json.dump(js, fout, ensure_ascii=False)

def usage():
    print "Converts all \\uXXXX codes in json into utf-8"
    print "Usage: .py infile outfile"
    sys.exit(1)

def main():
    try:
        fname_in, fname_out = sys.argv[1:]
    except Exception:
        usage()

    unescape_json(fname_in, fname_out)
    print "Done."

if __name__ == '__main__':
    main()
查看更多
祖国的老花朵
3楼-- · 2019-01-13 02:00

Don't rely on regexes: JSON has some strange corner-cases with \u escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.

Using a full JSON parser from the language of your choice is considerably more robust:

$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'

That's really just feeding the data to this short python script:

import json
import sys

data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))

From which you can save as foo.py and call as curl ... | foo.py

An example that will break most of the other attempts in this question is "\ud83d\udca3":

% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
                                                                    
查看更多
不美不萌又怎样
4楼-- · 2019-01-13 02:01

I found native2ascii from JDK as the best way to do it:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt

Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431

查看更多
该账号已被封号
5楼-- · 2019-01-13 02:03
iconv -f Unicode fullOrders.csv > fullOrders-utf8.csv
查看更多
该账号已被封号
6楼-- · 2019-01-13 02:05

I don't know which distribution you are using, but uni2ascii should be included.

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

Then to use it:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó
查看更多
来,给爷笑一个
7楼-- · 2019-01-13 02:10

Might be a bit ugly, but echo -e should do it:

echo -en "$(curl $URL)"

-e interprets escapes, -n suppresses the newline echo would normally add.

Note: The \u escape works in the bash builtin echo, but not /usr/bin/echo.

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).

查看更多
登录 后发表回答