I have the following XML file:
<d:entry id="a" d:title="a">
<d:index d:value="a" d:title="a"/>
<d:index d:value="b" d:title="b"/>
<d:index d:value="a" d:title="a"/>
<d:index d:value="c" d:title="c"/>
<d:index d:value="b" d:title="b"/>
<d:index d:value="a" d:title="a"/>
<d:index d:value="b" d:title="b"/>
<div>This is the content for entry.</div>
</d:entry>
<d:entry id="b" d:title="b">
<d:index d:value="a" d:title="a"/>
<d:index d:value="b" d:title="b"/>
<div>This is the content for entry.</div>
</d:entry>
(Whitespace added for readability.)
There are some duplicates of <d:index
, I need to get rid of all the duplicates and only keep one unique <d:index
. The desired effect is like this:
<d:entry id="a" d:title="a">
<d:index d:value="a" d:title="a"/>
<d:index d:value="b" d:title="b"/>
<d:index d:value="c" d:title="c"/>
<div>This is the content for entry.</div>
</d:entry>
<d:entry id="b" d:title="b">
<d:index d:value="a" d:title="a"/>
<d:index d:value="b" d:title="b"/>
<div>This is the content for entry.</div>
</d:entry>
I can do the regex replacement in some editors for that purpose, but it needs to be done multiple times, I was wondering if Perl has some ways to do this in one run.
The following is a common way to filter out duplicates:
This can be adapted to your needs, as shown in the following snippet:
(I used my preferred parser, XML::LibXML, since you didn't mention which parser you were using.)
Anyone who knows anything about XML will tell you not to do this using regex processing, but using a proper XML parser and XML tools. It can probably be done using regular expressions (though not by me) if you know that the format of the file will always be exactly as you have shown it, e.g. with the newlines and double quotes and attribute order exactly as in your example. But if you put this into production, then someone generating the XML is going to ask on StackOverflow in a years' time how to ensure that they can generate XML in precisely this format because the receiving application breaks if the attributes are in the wrong order or use single quotes rather than double quotes. So you're creating problems for the future. (Remember Postel's law, which in this case means that you should accept any well-formed XML that is equivalent to this XML).
In any case, it's so much easier to do this in XSLT than the way you are proposing. Assuming you want both attributes to match for the element to count as a duplicate, then the code is:
By the way, you said "whitespace added for readability". That whitespace, especially if it includes newlines, is going to have a major effect on any regex solution, but no effect at all on properly-written XSLT.
Using Mojo::DOM:
Results in:
-CS
switch to do this.