I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i)
instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8
" errors.
From what I understood, the net/http
library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode
with the replace and invalid options set, but no success so far...
相关问题
- How to specify memcache server to Rack::Session::M
- Why am I getting a “C compiler cannot create execu
- reference to a method?
- ruby 1.9 wrong file encoding on windows
- gem cleanup shows error: Unable to uninstall bundl
相关文章
- Ruby using wrong version of openssl
- Difference between Thread#run and Thread#wakeup?
- how to call a active record named scope with a str
- “No explicit conversion of Symbol into String” for
- Segmentation fault with ruby 2.0.0p247 leading to
- How to detect if an element exists in Watir
- uninitialized constant Mysql2::Client::SECURE_CONN
- ruby - simplify string multiply concatenation
My current solution is to run:
This will at least get rid of the exceptions which was my main problem
If you don't "care" about the data you can just do something like:
search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"
I just used
valid_encoding?
to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz.
CSV.parse(f, :encoding => "iso-8859-1")
, which turned my freaky deaky cyrillic K's into a much more manageable/\xCA/
, which I could then remove withstring.gsub!(/\xCA/, '')
I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:
This seems to work:
The accepted answer nor the other answer work for me. I found this post which suggested
This fixed the problem for me.