If you look at the output below in the after section ruby is removing all the html entities. How to parse XML with nokogiri without loosing HTML entities?
--- BEFORE ---
<blog:entryFull>
<p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
--- AFTER ---
<blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
</blog:example>
Here is the code:
f = File.open(item)
contents = ""
f.each {|line|
contents << line
}
puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"
doc = Nokogiri::XML::DocumentFragment.parse(contents)
puts doc
f.close
Qambar, I am unable to recreate your issue. However, I am able to produce your desired output given these files/input:
test.xml
nokogiri.rb
Console
The work-around that i did was to fetch the xml tag through regex and then convert html entities using html entities. Then parse it with nokogiri html parser.
Your test file might have some invalid HTML entities.
nokogiri.rb:
result:
so,
strict parsing example: