I have an old django app which was saving UTF-8 strings in the database in a way that made some look like invalid utf8 when I try to fetch them in Ruby.
Strings before saving were of type str
in python, but when fetched from the database django was giving me a proper unicode
string. When I fetch same record in rails I get a byte sequence that is identical to python's str
string and ruby complains that it's an invalid byte sequence.
Example: tested string was a single emoji:
I can't solve your problem but I can explain that byte sequence. What you have is UTF-8 encoded UTF-16.
Both,
237, 160, 189
and237, 180, 165
are 3-byte UTF-8 sequences:1110xxxx 10xxxxxx 10xxxxxx
(thex
's are the relevant bits)... which translate to codepoints
55357
and56613
respectively: (or0xD83D
and0xDD25
in hex)Unfortunately, these codepoints are invalid in UTF-8. That's because they are actually UTF-16 bytes: