Get a unicode from python's str byte sequence

2019-06-08 11:49发布

I have an old django app which was saving UTF-8 strings in the database in a way that made some look like invalid utf8 when I try to fetch them in Ruby.

Strings before saving were of type str in python, but when fetched from the database django was giving me a proper unicode string. When I fetch same record in rails I get a byte sequence that is identical to python's str string and ruby complains that it's an invalid byte sequence.

Example: tested string was a single emoji:

1条回答
别忘想泡老子
2楼-- · 2019-06-08 12:22

I can't solve your problem but I can explain that byte sequence. What you have is UTF-8 encoded UTF-16.

Both, 237, 160, 189 and 237, 180, 165 are 3-byte UTF-8 sequences:

  • 1110xxxx 10xxxxxx 10xxxxxx (the x's are the relevant bits)

... which translate to codepoints 55357 and 56613 respectively: (or 0xD83D and 0xDD25 in hex)

[237, 160, 189, 237, 180, 165].map { |b| b.to_s(2) }
#=> ["11101101", "10100000", "10111101", "11101101", "10110100", "10100101"]
#         ^^^^      ^^^^^^      ^^^^^^        ^^^^      ^^^^^^      ^^^^^^

[0b1101_100000_111101, 0b1101_110100_100101]
#=> [55357, 56613]

Unfortunately, these codepoints are invalid in UTF-8. That's because they are actually UTF-16 bytes:

[55357, 56613].pack('S>2').encode('utf-8', 'utf-16be')
#=> "                                                                    
查看更多
登录 后发表回答