after reading some of the merge posts out here, my question appears to be simpler and I am not capable to find out the answer. So I post a new question.
The original xml
<data>
<proteins>
<protein>
<accession>111</accession>
</protein>
</proteins>
<peptides>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
</peptides>
</data>
the xslt, used as an .xsl page to be interpreted by a browser
<xsl:template match="/">
<xsl:apply-templates select="/data/proteins/protein" />
</xsl:template>
<xsl:template match="/data/proteins/protein">
<xsl:apply-templates select="/data/peptides/peptide[accession = current()/accession]" >
</xsl:template>
<xsl:template match="/data/peptides/peptide">
...
</xsl:template>
the output that I got (conceptually, since this is a simplification of a larger code)
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
and the output that I would like to have, i.e. to have only one entry for each sequence, so to avoid having redudancy
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
I would be happy to have just the first of the nodes that share the same sequence (so not merging them).
Any help is highly welcomed :)
Thanks!
What your stylesheet is missing is a way to identify the first in a group of identical items. The following stylesheet uses an xsl:key
to group peptide
elements by a combination of their accession
and sequence
values:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:key name="byAccSeq" match="peptide"
use="concat(accession, '|', sequence)"/>
<xsl:template match="/">
<root><xsl:apply-templates select="/*/proteins/protein"/></root>
</xsl:template>
<xsl:template match="protein">
<xsl:apply-templates
select="../../peptides/peptide[accession=current()/accession]"/>
</xsl:template>
<xsl:template match="peptide[generate-id()=
generate-id(key('byAccSeq', concat(accession, '|', sequence))[1])]">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="peptide"/>
</xsl:stylesheet>
Output:
<root>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
</root>
Explanation: The following line:
<xsl:key name="byAccSeq" match="peptide"
use="concat(., accession, sequence)"/>
...groups peptide
elements using keys whose values are equal to concat(., accession, sequence)
. Elements can be later retrieved by reproducing the key for some peptide
element:
key('byAccSeq', concat(/path/to/peptide, accession, sequence))
To match the first element in the list of nodes returned for some key, we use the following template/pattern:
<xsl:template match="peptide[generate-id()=
generate-id(key('byAccSeq', concat(., accession, sequence))[1])]">
The generate-id
function returns a unique identifier for every node in the document. We're asking for any peptide
element whose unique ID is equal to the unique ID of a node that's first in the list for some key.
We then ignore all other peptide
elements -- the ones that aren't first for some key -- with the following template:
<xsl:template match="peptide"/>
This grouping technique is called the Muenchian Method. Further reading:
- http://www.jenitennison.com/xslt/grouping/muenchian.html
An alternate Muenchian grouping (just one template and a single instruction):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="kPepByAccAndSeq" match="peptide"
use="concat(accession, '+', sequence)"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/peptides
/peptide
[generate-id()
=
generate-id(key('kPepByAccAndSeq',
concat(accession, '+', sequence)
)[1]
)
]
"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied to the provided XML document:
<data>
<proteins>
<protein>
<accession>111</accession>
</protein>
</proteins>
<peptides>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
</peptides>
</data>
the wanted, correct result is produced:
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
</peptide>
Explanation: Muenchian grouping where the key value is a combination of the values of two elements.