I have a Sinatra application (http://analyzethis.espace-technologies.com) that does the following
- Retrieve an HTML page (via net/http)
- Create a Nokogiri document from the response.body
- Extract some info and send it back in the response. The response should be UTF-8 encoded
So I came to the problem while trying to read sites that use windows-1256 encodings like www.filfan.com or www.masrawy.com.
The problem is the result of the encoding conversion is not correct though no errors are thrown.
The net/http response.body.encoding gives ASCII-8BIT which can not be converted to UTF-8
If I do Nokogiri::HTML(response.body) and use the css selectors to get certain content from the page - say the content of the title tag for example - I get a string which when i call string.encoding returns WINDOWS-1256. I use string.encode("utf-8") and send the response using that but again the response is not correct.
Any suggestions or ideas about what's wrong in my approach?
I found the following code working for me now
Because Net::HTTP does not handle encoding correctly. See http://bugs.ruby-lang.org/issues/2567
You can parse
response['content-type']
which contains charset instead of parsing wholeresponse.body
.Then use
force_encoding()
to set right encoding.response.body.force_encoding("UTF-8")
if site is served in UTF-8.