Perl remove duplicate XML tags

2019-08-25 04:27发布

I have the following XML file:

<d:entry id="a" d:title="a">
  <d:index d:value="a" d:title="a"/>
  <d:index d:value="b" d:title="b"/>
  <d:index d:value="a" d:title="a"/>
  <d:index d:value="c" d:title="c"/>
  <d:index d:value="b" d:title="b"/>
  <d:index d:value="a" d:title="a"/>
  <d:index d:value="b" d:title="b"/>
  <div>This is the content for entry.</div>
</d:entry>
<d:entry id="b" d:title="b">
  <d:index d:value="a" d:title="a"/>
  <d:index d:value="b" d:title="b"/>
  <div>This is the content for entry.</div>
</d:entry>

(Whitespace added for readability.)

There are some duplicates of <d:index, I need to get rid of all the duplicates and only keep one unique <d:index. The desired effect is like this:

<d:entry id="a" d:title="a">
   <d:index d:value="a" d:title="a"/>
   <d:index d:value="b" d:title="b"/>
   <d:index d:value="c" d:title="c"/>
   <div>This is the content for entry.</div>
</d:entry>
<d:entry id="b" d:title="b">
  <d:index d:value="a" d:title="a"/>
  <d:index d:value="b" d:title="b"/>
  <div>This is the content for entry.</div>
</d:entry>

I can do the regex replacement in some editors for that purpose, but it needs to be done multiple times, I was wondering if Perl has some ways to do this in one run.

3条回答
霸刀☆藐视天下
2楼-- · 2019-08-25 04:50

The following is a common way to filter out duplicates:

my @filtered = grep { !$seen{$_}++ } @unfiltered;

This can be adapted to your needs, as shown in the following snippet:

my %seen;
for my $index_node ($xpc->findnodes('d:index', $entry_node)) {
   my $value = $xpc->findvalue('@d:value', $index_node);
   my $title = $xpc->findvalue('@d:title', $index_node);
   if ($seen{$value}{$title}++) {
      $index_node->unbind();
   }
}

(I used my preferred parser, XML::LibXML, since you didn't mention which parser you were using.)

查看更多
霸刀☆藐视天下
3楼-- · 2019-08-25 05:01

Anyone who knows anything about XML will tell you not to do this using regex processing, but using a proper XML parser and XML tools. It can probably be done using regular expressions (though not by me) if you know that the format of the file will always be exactly as you have shown it, e.g. with the newlines and double quotes and attribute order exactly as in your example. But if you put this into production, then someone generating the XML is going to ask on StackOverflow in a years' time how to ensure that they can generate XML in precisely this format because the receiving application breaks if the attributes are in the wrong order or use single quotes rather than double quotes. So you're creating problems for the future. (Remember Postel's law, which in this case means that you should accept any well-formed XML that is equivalent to this XML).

In any case, it's so much easier to do this in XSLT than the way you are proposing. Assuming you want both attributes to match for the element to count as a duplicate, then the code is:

<xsl:template match="d:entry">
<xsl:copy>
  <xsl:for-each-group select="d:index" 
                      group-by="concat(@d:value, '~', @d:title)">
     <xsl:copy-of select="current-group()[1]"/>
  </xsl:for-each-group>
  <xsl:copy-of select="div"/>
</xsl:copy>
</xsl:template>

By the way, you said "whitespace added for readability". That whitespace, especially if it includes newlines, is going to have a major effect on any regex solution, but no effect at all on properly-written XSLT.

查看更多
甜甜的少女心
4楼-- · 2019-08-25 05:07

Using Mojo::DOM:

perl -MMojo::DOM -0777 -E'my $dom = Mojo::DOM->new->xml(1)->parse(<>);
  $dom->find(q{d\\:entry})->each(sub { my %seen;
    $_->find(q{d\\:index})->each(sub {
      $_->remove if $seen{$_->{"d:value"}}{$_->{"d:title"}}++ }) });
  print $dom->to_string' input.xml

Results in:

<d:entry d:title="a" id="a">
  <d:index d:title="a" d:value="a" />
  <d:index d:title="b" d:value="b" />

  <d:index d:title="c" d:value="c" />



  <div>This is the content for entry.</div>
</d:entry>
<d:entry d:title="b" id="b">
  <d:index d:title="a" d:value="a" />
  <d:index d:title="b" d:value="b" />
  <div>This is the content for entry.</div>
</d:entry>
  • If the actual content doesn't have such whitespace, it won't be left over after removing the tags. Otherwise a little more logic can remove the whitespace text nodes.
  • I would use ojo for this but it doesn't have a shortcut for XML-mode parsing.
  • If the XML contains any non-ascii characters you will need to decode it on STDIN and encode it on STDOUT according to its encoding; if it's the usual UTF-8, you can use the -CS switch to do this.
查看更多
登录 后发表回答