I am having a very difficult time with this:
# contained within:
"MA\u008EEIKIAI"
# should be
"MAŽEIKIAI"
# nature of string
$ p string3
"MA\u008EEIKIAI"
$ puts string3
MAEIKIAI
$ string3.inspect
"\"MA\\u008EEIKIAI\""
$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes>
Any ideas on where to start?
Note: this is not a duplicate of my previous question.
What about using
Regexp
&String#pack
to convert the Unicode escape?\u008E
means that the unicode character with the codepoint8e
(in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The characterŽ
is at the codepointu017d
. However it is at position8e
in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.
Assuming the string is in UTF-8 encoding,
\u008E
will consist of the two bytesc2
and8e
. Note that the second byte,8e
, is the same as the encoding ofŽ
in CP-1252. On way to convert the string would be something like this:Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes
c2 xx
wherexx
the correct byte (like in this case), others will bec3 yy
whereyy
is a different byte.