I am using open-uri to read a webpage which claims to be encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.
open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
=> ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>]
I am guessing this is because the webpage has the byte (or character) \x92 which is not a valid iso-8859 character. http://en.wikipedia.org/wiki/ISO/IEC_8859-1.
I need to store webpages as utf-8 encoded files. Any ideas on how to deal with webpage where the encoding is incorrect. I could catch the exception and try to guess the correct encoding but that seems cumbersome and error-prone.
ASCII-8BIT is an alias for BINARY
open-uri
does a funny thing: if the file is less than 10kb (or something like that), it returns aString
and if it's bigger then it returns aStringIO
. That can be confusing if you're trying to deal with encoding issues.If the files aren't huge, I would recommend manually loading them into strings:
Then you can use the https://rubygems.org/gems/ensure-encoding gem
I have been pretty happy with
ensure-encoding
... we use it in production at http://data.brighterplanet.comNote that you can also say
:invalid_characters => :ignore
instead of:transcode
.Also, if you know the encoding somehow, you can pass
:external_encoding => 'ISO-8859-1'
instead of:sniff