Inserting and deleting XML nodes and elements usin

2019-04-18 17:01发布

问题:

I want to extract parts of an XML file and make a note that I extracted some part in that file, like "here something was extracted".

I'm trying to do this with Nokogiri, but it seems to not really be documented on how to:

  1. delete all childs of a <Nokogiri::XML::Element>
  2. change the inner_text of that complete element

Any clues?

回答1:

Nokogiri makes this pretty easy. Using this document as an example, the following code will find all vitamins tags, remove their children (and the children's children, etc.), and change their inner text to say "Children removed.":

require 'nokogiri'

io = File.open('sample.xml', 'r')
doc = Nokogiri::XML(io)
io.close

doc.search('//vitamins').each do |node|
  node.children.remove
  node.content = 'Children removed.'
end

A given food node will go from looking like this:

<food>
    <name>Avocado Dip</name>
    <mfr>Sunnydale</mfr>
    <serving units="g">29</serving>
    <calories total="110" fat="100"/>
    <total-fat>11</total-fat>
    <saturated-fat>3</saturated-fat>
    <cholesterol>5</cholesterol>
    <sodium>210</sodium>
    <carb>2</carb>
    <fiber>0</fiber>
    <protein>1</protein>
    <vitamins>
        <a>0</a>
        <c>0</c>
    </vitamins>
    <minerals>
        <ca>0</ca>
        <fe>0</fe>
    </minerals>
</food>

to this:

<food>
    <name>Avocado Dip</name>
    <mfr>Sunnydale</mfr>
    <serving units="g">29</serving>
    <calories total="110" fat="100"/>
    <total-fat>11</total-fat>
    <saturated-fat>3</saturated-fat>
    <cholesterol>5</cholesterol>
    <sodium>210</sodium>
    <carb>2</carb>
    <fiber>0</fiber>
    <protein>1</protein>
    <vitamins>Children removed.</vitamins>
    <minerals>
        <ca>0</ca>
        <fe>0</fe>
    </minerals>
</food>


回答2:

The previous Nokogiri example set me in the right direction, but using doc.search left a malformed //vitamins, so I used CSS:

require "rubygems"
require "nokogiri"

f = File.open("food.xml")
doc = Nokogiri::XML(f)

doc.css("food vitamins").each do |node|
  puts "\r\n[debug] Before: vitamins= \r\n#{node}"
  node.children.remove
  node.content = "Children removed"
  puts "\r\n[debug] After: vitamins=\r\n#{node}"
end
f.close

Which results in:

debug] Before: vitamins= 
<vitamins>
        <a>0</a>
        <c>0</c>
    </vitamins>

[debug] After: vitamins=
<vitamins>Children removed</vitamins>


回答3:

You can do it like this:

doc=Nokogiri::XML(your_document)
note=doc.search("note") # find all tags with the node_name "note"
note.remove

While that would remove all children within the <note> tag, I am not sure how to "change the inner_text" of all note elements. I think inner_text is not applicable for a Nokogiri::XML::Element.



回答4:

Here's what I'd do:

Parse some XML first:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="nutrition.css"?>
<nutrition>

  <daily-values>
    <total-fat units="g">65</total-fat>
    <saturated-fat units="g">20</saturated-fat>
    <cholesterol units="mg">300</cholesterol>
    <sodium units="mg">2400</sodium>
    <carb units="g">300</carb>
    <fiber units="g">25</fiber>
    <protein units="g">50</protein>
  </daily-values>

  <food>
    <name>Avocado Dip</name>
    <mfr>Sunnydale</mfr>
    <serving units="g">29</serving>
    <calories total="110" fat="100"/>
    <total-fat>11</total-fat>
    <saturated-fat>3</saturated-fat>
    <cholesterol>5</cholesterol>
    <sodium>210</sodium>
    <carb>2</carb>
    <fiber>0</fiber>
    <protein>1</protein>
    <vitamins>
      <a>0</a>
      <c>0</c>
    </vitamins>
    <minerals>
      <ca>0</ca>
      <fe>0</fe>
    </minerals>
  </food>

</nutrition>
EOT

If I want to delete a node's content, I can remove its children or assign nil to its content:

doc.at('total-fat').to_xml # => "<total-fat units=\"g\">65</total-fat>"
doc.at('total-fat').children.remove
doc.at('total-fat').to_xml # => "<total-fat units=\"g\"/>"

or:

doc.at('saturated-fat').to_xml # => "<saturated-fat units=\"g\">20</saturated-fat>"
doc.at('saturated-fat').content = nil
doc.at('saturated-fat').to_xml # => "<saturated-fat units=\"g\"/>"

If I want to extract the text from a node for use some other way:

food = doc.at('food').text
# => "\n    Avocado Dip\n    Sunnydale\n    29\n    \n    11\n    3\n    5\n    210\n    2\n    0\n    1\n    \n      0\n      0\n    \n    \n      0\n      0\n    \n  "

or:

food = doc.at('food').children.map(&:text)
# => ["\n    ",
#     "Avocado Dip",
#     "\n    ",
#     "Sunnydale",
#     "\n    ",
#     "29",
#     "\n    ",
#     "",
#     "\n    ",
#     "11",
#     "\n    ",
#     "3",
#     "\n    ",
#     "5",
#     "\n    ",
#     "210",
#     "\n    ",
#     "2",
#     "\n    ",
#     "0",
#     "\n    ",
#     "1",
#     "\n    ",
#     "\n      0\n      0\n    ",
#     "\n    ",
#     "\n      0\n      0\n    ",
#     "\n  "]

or however else you want to mangle the text.

And, if you want to mark that you've removed the text:

doc.at('food').content = 'REMOVED'
doc.at('food').to_xml # => "<food>REMOVED</food>"

You could also use an XML comment instead:

doc.at('food').children = '<!-- REMOVED -->'
doc.at('food').to_xml # => "<food>\n  <!-- REMOVED -->\n</food>"


标签: ruby nokogiri