How to parse XML with nokogiri without losing HTML

2019-07-17 07:59发布

问题:

If you look at the output below in the after section ruby is removing all the html entities. How to parse XML with nokogiri without loosing HTML entities?

--- BEFORE ---

<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

--- AFTER --- 

<blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
  </blog:example>

Here is the code:

f = File.open(item)

contents = ""
f.each {|line|
  contents << line
}

puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"

doc = Nokogiri::XML::DocumentFragment.parse(contents) 
puts doc
f.close 

回答1:

Your test file might have some invalid HTML entities.

nokogiri.rb:

require 'nokogiri'

puts "--- INVALID ---"
invalid_xml = <<-XML
<blog:entryFull>invalid M&Ms</blog:entryFull><!-- invalid M and M's -->
<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
XML

doc = Nokogiri::XML::DocumentFragment.parse(invalid_xml)
puts doc

puts "--- VALID ---"
valid_xml = <<-XML
<blog:entryFull>valid M&amp;Ms</blog:entryFull><!-- valid M and M's -->
<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
XML

doc = Nokogiri::XML::DocumentFragment.parse(valid_xml)
puts doc

result:

$ ruby nokogiri.rb
--- INVALID ---
<blog:entryFull>invalid M</blog:entryFull><!-- invalid M and M's -->
<blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
--- VALID ---
<blog:entryFull>valid M&amp;Ms</blog:entryFull><!-- valid M and M's -->
<blog:entryFull>
&lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

so,

  1. Fix input XML
  2. Use STRICT ParseOptions

strict parsing example:

invalid_xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <blog:entryFull>invalid M&Ms</blog:entryFull>
  <blog:entryFull>
  &lt;p&gt;&lt;iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
</root>
XML

begin
  doc = Nokogiri::XML(invalid_xml) do |configure|
    configure.strict # strict parsing
  end
  puts doc
rescue => e
  puts 'INVALID XML'
end


回答2:

Qambar, I am unable to recreate your issue. However, I am able to produce your desired output given these files/input:

test.xml

<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>

nokogiri.rb

require 'nokogiri'

f = File.open("./test.html")

contents = ""
f.each {|line|
  contents << line
}

puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"

doc = Nokogiri::XML::DocumentFragment.parse(contents) 
puts doc.inner_html
f.close

Console

Development/Code » ruby nokogiri.rb
--- BEFORE ---
<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>
--- AFTER ---
<blog:entryFull> &lt;p&gt;&lt;iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"&gt;&lt;/iframe&gt;&lt;/p&gt;</blog:entryFull>


回答3:

The work-around that i did was to fetch the xml tag through regex and then convert html entities using html entities. Then parse it with nokogiri html parser.