Convert escaped unicode (\u008E) to accented chara

I am having a very difficult time with this:

# contained within:
"MA\u008EEIKIAI"

# should be
"MAŽEIKIAI"

# nature of string
$ p string3
"MA\u008EEIKIAI" 

$ puts string3
MAEIKIAI

$ string3.inspect
"\"MA\\u008EEIKIAI\""

$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes>

Any ideas on where to start?

Note: this is not a duplicate of my previous question.

标签： ruby encoding

2条回答

smile是对你的礼貌

2楼-- · 2019-06-21 19:32

What about using Regexp & String#pack to convert the Unicode escape?

str = "MA\\u008EEIKIAI"
puts str    #=> MA\u008EEIKIAI

str.gsub!(/\\u(.{4})/) do |match|
  [$1.to_i(16)].pack('U')
end
puts str    #=> MA EIKIAI

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-06-21 19:39

\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.

The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.

Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:

string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '')       # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8')          # convert to the desired encoding

Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.

0人赞添加讨论(0) 举报

Convert escaped unicode (\u008E) to accented chara

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间