How to convert \uXXXX unicode to UTF-8 using conso

I use curl to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń) and \u00f3 (ó).

How can I convert them to UTF-8 or any other encoding to save into file?

标签： linux json unix unicode encoding

9条回答

别忘想泡老子

2楼-- · 2019-01-13 01:57

Works on Windows, should work on *nix too. Uses python 2.

#!/usr/bin/env python
from __future__ import unicode_literals
import sys
import json
import codecs

def unescape_json(fname_in, fname_out):
    with file(fname_in, 'rb') as fin:
        js = json.load(fin)
    with codecs.open(fname_out, 'wb', 'utf-8') as fout:
        json.dump(js, fout, ensure_ascii=False)

def usage():
    print "Converts all \\uXXXX codes in json into utf-8"
    print "Usage: .py infile outfile"
    sys.exit(1)

def main():
    try:
        fname_in, fname_out = sys.argv[1:]
    except Exception:
        usage()

    unescape_json(fname_in, fname_out)
    print "Done."

if __name__ == '__main__':
    main()

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2019-01-13 02:00

Don't rely on regexes: JSON has some strange corner-cases with \u escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.

Using a full JSON parser from the language of your choice is considerably more robust:

$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'

That's really just feeding the data to this short python script:

import json
import sys

data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))

From which you can save as foo.py and call as curl ... | foo.py

An example that will break most of the other attempts in this question is "\ud83d\udca3":

% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo


             
            
                                  
            
            
            
            
            
            不美不萌又怎样                          
            
             
             4楼-- · 2019-01-13 02:01
             
             
             
                          
             
                                                                          
I found native2ascii from JDK as the best way to do it:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt


Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

Update:
No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            该账号已被封号                          
            
             
             5楼-- · 2019-01-13 02:03
             
             
             
                          
             
                                                                          
iconv -f Unicode fullOrders.csv > fullOrders-utf8.csv

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            该账号已被封号                          
            
             
             6楼-- · 2019-01-13 02:05
             
             
             
                          
             
                                                                          
I don't know which distribution you are using, but uni2ascii should be included.

$ sudo apt-get install uni2ascii


It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

Then to use it:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            来，给爷笑一个                          
            
             
             7楼-- · 2019-01-13 02:10
             
             
             
                          
             
                                                                          
Might be a bit ugly, but echo -e should do it:

echo -en "$(curl $URL)"


-e interprets escapes, -n suppresses the newline echo would normally add.

Note: The \u escape works in the bash builtin echo, but not /usr/bin/echo.

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
   1
2
下一页


     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   Jackson Deserialization not calling deserialize on   

   



     


   
   How to maintain order of key-value in DataFrame sa   

   



     


   
   StackExchange API - Deserialize Date in JSON Respo   

   



     


   
   Is shmid returned by shmget() unique across proces   

   



     


   
   Easiest way to get json and parse it using JQuery   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   有关Linux硬盘分区   

     


   
   .Net5(NetCore)中发布文件的时候,runtime文件夹如何随项目发布   

     


   
   使用2台跳板机的情况下如何使用scp传文件   

     


   
   linux nohup命令写错，导致/usr/local 目录消失。咋办？   

     


   
   openvas 安装错误   

     


   
   java项目突然挂掉，日志无报错信息   

     


   
   powerdesigner有linux版本吗?   

     


   
   我刚买了一个 linux云服务器 有不有大佬分享一下教程   

        
        
    查看全部
                 收藏的人(4)

How to convert \uXXXX unicode to UTF-8 using conso

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间