I'm scrapling an html page with nokogiri and i want to strip out all style attributes.
How can I achieve this? (i'm not using rails so i can't use it's sanitize method and i don't want to use sanitize gem 'cause i want to blacklist remove not whitelist)
html = open(url)
doc = Nokogiri::HTML(html.read)
doc.css('.post').each do |post|
puts post.to_s
end
=> <p><span style="font-size: x-large">bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>
I want it to be
=> <p><span>bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>
Edited to show that you can just call
NodeSet#remove
instead of having to use.each(&:remove)
.Note that if you have a DocumentFragment instead of a Document, Nokogiri has a longstanding bug where searching from a fragment does not work as you would expect. The workaround is to use:
I tried the answer from Phrogz but could not get it to work (I was using a document fragment though but I'd have thought it should work the same?).
The "//" at the start didn't seem to be checking all nodes as I would expect. In the end I did something a bit more long winded but it worked, so here for the record in case anyone else has the same trouble is my solution (dirty though it is):
Cheers
Pete
This works with both a document and a document fragment:
or
To delete all the 'style' attributes, you can do a