Get a unicode from python's str byte sequence

I have an old django app which was saving UTF-8 strings in the database in a way that made some look like invalid utf8 when I try to fetch them in Ruby.

Strings before saving were of type str in python, but when fetched from the database django was giving me a proper unicode string. When I fetch same record in rails I get a byte sequence that is identical to python's str string and ruby complains that it's an invalid byte sequence.

Example: tested string was a single emoji:

标签： python ruby-on-rails ruby utf-8 character-encoding

1条回答

别忘想泡老子

2楼-- · 2019-06-08 12:22

I can't solve your problem but I can explain that byte sequence. What you have is UTF-8 encoded UTF-16.

Both, 237, 160, 189 and 237, 180, 165 are 3-byte UTF-8 sequences:

1110xxxx 10xxxxxx 10xxxxxx (the x's are the relevant bits)

... which translate to codepoints 55357 and 56613 respectively: (or 0xD83D and 0xDD25 in hex)

[237, 160, 189, 237, 180, 165].map { |b| b.to_s(2) }
#=> ["11101101", "10100000", "10111101", "11101101", "10110100", "10100101"]
#         ^^^^      ^^^^^^      ^^^^^^        ^^^^      ^^^^^^      ^^^^^^

[0b1101_100000_111101, 0b1101_110100_100101]
#=> [55357, 56613]

Unfortunately, these codepoints are invalid in UTF-8. That's because they are actually UTF-16 bytes:

[55357, 56613].pack('S>2').encode('utf-8', 'utf-16be')
#=> "


     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   how to define constructor for Python's new Nam   

   



     


   
   streaming md5sum of contents of a large remote tar   

   



     


   
   How to get the background from multiple images by   

   



     


   
   Question marks after images and js/css files in ra   

   



     


   
   Evil ctypes hack in python   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   问个python基础问题，为什么时间不更新 及 name 'ss' is not   

     


   
   c#调用python3程序   

     


   
   如何安全的关闭程序   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   有没有方法即使程序最小化也能对其发送按键   

     


   
   tkinter这样怎么不能分别赋值？   

     


   
   mouseMoveEvent奇怪的崩溃   

     


   
   在liunx 安装Levenshtein错误   

        
        
    查看全部
                 收藏的人(4)

Get a unicode from python's str byte sequence

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间