How can I remove duplicate XML nodes using Ruby?

2020-06-24 06:33发布

问题:

Suppose I have this structure:

<one>
   <two>
     <three>3</three>
   </two>

   <two>
     <three>4</three>
   </two>

   <two>
     <three>3</three>
   </two>
</one>

Is there anyway of getting to this :

<one>
  <two>
    <three>3</three>
  </two>

  <two>
    <three>4</three>
  </two>

</one>

using Ruby's libraries? I managed to get this using Nokogiri. From my tests, it appears to work, but maybe there's another approach, a better one.

回答1:

How about one that does the whole thing in two lines?

seen = Hash.new(0)
node.traverse {|n| n.unlink if (seen[n.to_xml] += 1) > 1}

If there's a possibility of the same node appearing under two different parents, and you don't want those to be considered duplicates, you can change that second line to:

node.traverse {|n| n.unlink if (seen[(n.parent.path rescue "") + n.to_xml] += 1) > 1}


回答2:

This page explains XML parsing in Ruby a little bit http://developer.yahoo.com/ruby/ruby-xml.html

This page explains some of the reasons why you want to use a proper parser over something like regular expressions: http://htmlparsing.icenine.ca

At a glance, the approach you're using doesn't seem horrible.



标签: xml ruby