Open iso-8859-1 encoded html with nokogiri messes

2019-08-02 07:57发布

问题:

I'm trying to make some changes to an html page encoded with charset=iso-8859-1

doc = Nokogiri::HTML(open(html_file))

puts doc.to_html messes up all the accents in the page. So if I save it back it looks broken in the browser as well.

I'm still on Rails 3.0.6... Any hints how to fix this problem?

Here's one of the pages suffering from that for example: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html

I've asked also in Github but I have the feeling this will be faster. I'll update both places if I get a cure for the problem.

UPDATE 1 24 March 2012

Thanks for the comments. I managed to partially solve this issue. I believe this has nothing to do with Nokogiri however. As I mentioned in some comment I just need to open and save the file to get the accents messed up.

The closest to a fix I got is doing this:

thefile = File.open(html_file, "r") 
text =  thefile.read
doc = Nokogiri::HTML(text)
... do any stuff with nokogiri
File.open(html_file, 'w') {|f| f.write(doc.to_html) }

The original file came with iso-8859-1, the save one goes in utf-8 pretty much it looks ok. Accents are in place. Except for the access in the capital letter :-P I get question marks like in Econom�a , there should be í (i with an accent)

Getting closer I think. If someone has a hint to cover the capital letters as well it might be almost done.

Cheers.

回答1:

The method you used to download the file may have changed the encoding, breaking the accents in the file. Try this to see it working correctly:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = 'http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html'
doc = Nokogiri::HTML(open(url))
File.open("1331108705.html", "w") {|f| f.write(doc.to_html)}
system('open', '1331108705.html') # on Mac OS X, this will open the html file in your browser

How did you download the file?