I want to extract parts of an XML file and make a note that I extracted some part in that file, like "here something was extracted".
I'm trying to do this with Nokogiri, but it seems to not really be documented on how to:
- delete all childs of a
<Nokogiri::XML::Element>
- change the
inner_text
of that complete element
Any clues?
Nokogiri makes this pretty easy. Using this document as an example, the following code will find all vitamins
tags, remove their children (and the children's children, etc.), and change their inner text to say "Children removed.":
require 'nokogiri'
io = File.open('sample.xml', 'r')
doc = Nokogiri::XML(io)
io.close
doc.search('//vitamins').each do |node|
node.children.remove
node.content = 'Children removed.'
end
A given food
node will go from looking like this:
<food>
<name>Avocado Dip</name>
<mfr>Sunnydale</mfr>
<serving units="g">29</serving>
<calories total="110" fat="100"/>
<total-fat>11</total-fat>
<saturated-fat>3</saturated-fat>
<cholesterol>5</cholesterol>
<sodium>210</sodium>
<carb>2</carb>
<fiber>0</fiber>
<protein>1</protein>
<vitamins>
<a>0</a>
<c>0</c>
</vitamins>
<minerals>
<ca>0</ca>
<fe>0</fe>
</minerals>
</food>
to this:
<food>
<name>Avocado Dip</name>
<mfr>Sunnydale</mfr>
<serving units="g">29</serving>
<calories total="110" fat="100"/>
<total-fat>11</total-fat>
<saturated-fat>3</saturated-fat>
<cholesterol>5</cholesterol>
<sodium>210</sodium>
<carb>2</carb>
<fiber>0</fiber>
<protein>1</protein>
<vitamins>Children removed.</vitamins>
<minerals>
<ca>0</ca>
<fe>0</fe>
</minerals>
</food>
The previous Nokogiri example set me in the right direction, but using doc.search
left a malformed //vitamins
, so I used CSS:
require "rubygems"
require "nokogiri"
f = File.open("food.xml")
doc = Nokogiri::XML(f)
doc.css("food vitamins").each do |node|
puts "\r\n[debug] Before: vitamins= \r\n#{node}"
node.children.remove
node.content = "Children removed"
puts "\r\n[debug] After: vitamins=\r\n#{node}"
end
f.close
Which results in:
debug] Before: vitamins=
<vitamins>
<a>0</a>
<c>0</c>
</vitamins>
[debug] After: vitamins=
<vitamins>Children removed</vitamins>
You can do it like this:
doc=Nokogiri::XML(your_document)
note=doc.search("note") # find all tags with the node_name "note"
note.remove
While that would remove all children within the <note>
tag, I am not sure how to "change the inner_text" of all note elements. I think inner_text
is not applicable for a Nokogiri::XML::Element.
Here's what I'd do:
Parse some XML first:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="nutrition.css"?>
<nutrition>
<daily-values>
<total-fat units="g">65</total-fat>
<saturated-fat units="g">20</saturated-fat>
<cholesterol units="mg">300</cholesterol>
<sodium units="mg">2400</sodium>
<carb units="g">300</carb>
<fiber units="g">25</fiber>
<protein units="g">50</protein>
</daily-values>
<food>
<name>Avocado Dip</name>
<mfr>Sunnydale</mfr>
<serving units="g">29</serving>
<calories total="110" fat="100"/>
<total-fat>11</total-fat>
<saturated-fat>3</saturated-fat>
<cholesterol>5</cholesterol>
<sodium>210</sodium>
<carb>2</carb>
<fiber>0</fiber>
<protein>1</protein>
<vitamins>
<a>0</a>
<c>0</c>
</vitamins>
<minerals>
<ca>0</ca>
<fe>0</fe>
</minerals>
</food>
</nutrition>
EOT
If I want to delete a node's content, I can remove its children
or assign nil to its content:
doc.at('total-fat').to_xml # => "<total-fat units=\"g\">65</total-fat>"
doc.at('total-fat').children.remove
doc.at('total-fat').to_xml # => "<total-fat units=\"g\"/>"
or:
doc.at('saturated-fat').to_xml # => "<saturated-fat units=\"g\">20</saturated-fat>"
doc.at('saturated-fat').content = nil
doc.at('saturated-fat').to_xml # => "<saturated-fat units=\"g\"/>"
If I want to extract the text from a node for use some other way:
food = doc.at('food').text
# => "\n Avocado Dip\n Sunnydale\n 29\n \n 11\n 3\n 5\n 210\n 2\n 0\n 1\n \n 0\n 0\n \n \n 0\n 0\n \n "
or:
food = doc.at('food').children.map(&:text)
# => ["\n ",
# "Avocado Dip",
# "\n ",
# "Sunnydale",
# "\n ",
# "29",
# "\n ",
# "",
# "\n ",
# "11",
# "\n ",
# "3",
# "\n ",
# "5",
# "\n ",
# "210",
# "\n ",
# "2",
# "\n ",
# "0",
# "\n ",
# "1",
# "\n ",
# "\n 0\n 0\n ",
# "\n ",
# "\n 0\n 0\n ",
# "\n "]
or however else you want to mangle the text.
And, if you want to mark that you've removed the text:
doc.at('food').content = 'REMOVED'
doc.at('food').to_xml # => "<food>REMOVED</food>"
You could also use an XML comment instead:
doc.at('food').children = '<!-- REMOVED -->'
doc.at('food').to_xml # => "<food>\n <!-- REMOVED -->\n</food>"