adding backslash to fix character encoding in ruby

2019-05-27 05:02发布

I'm sure this is very easy but I'm getting tied in a knot with all these backslashes.

I have some data that I'm scraping (politely) from a website. Occasionally a sentence comes to me looking something like this:

u00a362 000? you must be joking

Which should of course be '£2 000? you must be joking'. A short test in irb deciphered it.

ruby-1.9.2-p180 :001 > string = "u00a3"
  => "u00a3" 
ruby-1.9.2-p180 :002 > string = "\u00a3"
  => "£" 

Of course: add a backslash and it will be decoded. I created the following with the help of this question:

puts str.gsub('u00', '\\u00') 

which resulted in \u00a3 being output. This is all well and good, but I want it to be £ in the string itself. just putsing it isn't enough.

It's no good doing gsub('u00a3', '£') as there will doubtless be other characters I'm missing.

thanks for any help.

2条回答
地球回转人心会变
2楼-- · 2019-05-27 05:30

Warning, the following is not really pretty.

str = "u00a362 000? you must be joking"
split_unicode = str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/)
final = split_unicode.map do |elem|
  if elem =~ /^u00/
    [("0x" + elem.gsub(/u00/, '')).hex].pack("U*")
  else
    elem
  end
end
puts final.join

So the idea here is to find u00xx values and convert them to hex. From there, we can use the pack method to output the right unicode characters.

It can also be crunched in an horrible one-liner!

puts (str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/).map {|elem| elem =~ /^u00/ ? [("0x" + elem.gsub(/u00/, '')).hex].pack("U*") : elem}).join

There might be a better solution (I hope!) but this one works.

查看更多
我命由我不由天
3楼-- · 2019-05-27 05:50

Try the Iconv library for converting the incoming string. You might also take a look at the stringex gem. It has methods to "go the other way" but it may provide the mappings you're looking for. That said if you've got bad encoding it can be impossible to get it right.

查看更多
登录 后发表回答