I'm sure this is very easy but I'm getting tied in a knot with all these backslashes.
I have some data that I'm scraping (politely) from a website. Occasionally a sentence comes to me looking something like this:
u00a362 000? you must be joking
Which should of course be '£2 000? you must be joking'. A short test in irb deciphered it.
ruby-1.9.2-p180 :001 > string = "u00a3"
=> "u00a3"
ruby-1.9.2-p180 :002 > string = "\u00a3"
=> "£"
Of course: add a backslash and it will be decoded. I created the following with the help of this question:
puts str.gsub('u00', '\\u00')
which resulted in \u00a3
being output. This is all well and good, but I want it to be £ in the string itself. just puts
ing it isn't enough.
It's no good doing gsub('u00a3', '£')
as there will doubtless be other characters I'm missing.
thanks for any help.
Warning, the following is not really pretty.
str = "u00a362 000? you must be joking"
split_unicode = str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/)
final = split_unicode.map do |elem|
if elem =~ /^u00/
[("0x" + elem.gsub(/u00/, '')).hex].pack("U*")
else
elem
end
end
puts final.join
So the idea here is to find u00xx
values and convert them to hex. From there, we can use the pack
method to output the right unicode characters.
It can also be crunched in an horrible one-liner!
puts (str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/).map {|elem| elem =~ /^u00/ ? [("0x" + elem.gsub(/u00/, '')).hex].pack("U*") : elem}).join
There might be a better solution (I hope!) but this one works.
Try the Iconv library for converting the incoming string. You might also take a look at the stringex gem. It has methods to "go the other way" but it may provide the mappings you're looking for. That said if you've got bad encoding it can be impossible to get it right.