ruby 1.9: invalid byte sequence in UTF-8

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

标签： ruby encoding utf-8

11条回答

像晚风撩人

2楼-- · 2019-01-02 19:58

My current solution is to run:

my_string.unpack("C*").pack("U*")

This will at least get rid of the exceptions which was my main problem

0人赞添加讨论(0) 举报

大哥的爱人

3楼-- · 2019-01-02 19:58

If you don't "care" about the data you can just do something like:

search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"

I just used valid_encoding? to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

0人赞添加讨论(0) 举报

初与友歌

4楼-- · 2019-01-02 19:59

While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. CSV.parse(f, :encoding => "iso-8859-1"), which turned my freaky deaky cyrillic K's into a much more manageable /\xCA/, which I could then remove with string.gsub!(/\xCA/, '')

0人赞添加讨论(0) 举报

荒废的爱情

5楼-- · 2019-01-02 20:01

I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:

ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t

0人赞添加讨论(0) 举报

高级女魔头

6楼-- · 2019-01-02 20:05

This seems to work:

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end

0人赞添加讨论(0) 举报

人间绝色

7楼-- · 2019-01-02 20:09

The accepted answer nor the other answer work for me. I found this post which suggested

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

This fixed the problem for me.

0人赞添加讨论(0) 举报

1 2 下一页

ruby 1.9: invalid byte sequence in UTF-8

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间