To start off, I'm aware of this SO question which is a bit different.
I have an XML file which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>123</MsgId>
<CreDtTm>321</CreDtTm>
<NbOfTxs>10</NbOfTxs>
<CtrlSum>18700.68</CtrlSum>
<InitgPty>
<Nm>some info</Nm>
</InitgPty>
</GrpHdr>
<PmtInf>
<!-- start -->
<PmtInfId>asd</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>false</BtchBookg>
<PmtTpInf>
<InstrPrty>NORM</InstrPrty>
<SvcLvl>
<Prtry>test</Prtry>
</SvcLvl>
</PmtTpInf>
<ReqdExctnDt>date</ReqdExctnDt>
<Dbtr>
<Nm>something</Nm>
<PstlAdr>
<AdrLine>addr 1</AdrLine>
</PstlAdr>
</Dbtr>
<!-- end -->
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
</PmtInf>
<PmtInf>
<!-- start -->
<PmtInfId>asd</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>false</BtchBookg>
<PmtTpInf>
<InstrPrty>NORM</InstrPrty>
<SvcLvl>
<Prtry>test</Prtry>
</SvcLvl>
</PmtTpInf>
<ReqdExctnDt>date</ReqdExctnDt>
<Dbtr>
<Nm>something</Nm>
<PstlAdr>
<AdrLine>addr 1</AdrLine>
</PstlAdr>
</Dbtr>
<!-- end -->
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
</PmtInf>
<PmtInf>
<!-- start -->
<PmtInfId>asd</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>false</BtchBookg>
<PmtTpInf>
<InstrPrty>NORM</InstrPrty>
<SvcLvl>
<Prtry>test</Prtry>
</SvcLvl>
</PmtTpInf>
<ReqdExctnDt>date</ReqdExctnDt>
<Dbtr>
<Nm>something</Nm>
<PstlAdr>
<AdrLine>addr 1</AdrLine>
</PstlAdr>
</Dbtr>
<!-- end -->
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
</PmtInf>
<PmtInf>
<!-- start -->
<PmtInfId>asd</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>false</BtchBookg>
<PmtTpInf>
<InstrPrty>NORM</InstrPrty>
<SvcLvl>
<Prtry>test</Prtry>
</SvcLvl>
</PmtTpInf>
<ReqdExctnDt>date</ReqdExctnDt>
<Dbtr>
<Nm>something</Nm>
<PstlAdr>
<AdrLine>addr 1</AdrLine>
</PstlAdr>
</Dbtr>
<!-- end -->
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
</PmtInf>
</CstmrCdtTrfInitn>
</Document>
- as you can see, I have multiple (4)
<PmtInf></PmtInf>
sections which have almost the same structure. what I'd like to do, is:
- compare
<PmtInfId>asd</PmtInfId>
from the firstPmtInf
with<PmtInfId>asd</PmtInfId>
from the secondPmtInf
. If there's a perfect match(as in the same tag and text), move to the next tag elements and compare them (<PmtMtd>TRF</PmtMtd>
from the firstPmtInf
with<PmtMtd>TRF</PmtMtd>
from the secondPmtInf
... and if there's always a perfect match do so until we reach the<CdtTrfTxInf>
tag. - when we reached
<CdtTrfTxInf>
, it means the first part of the firstPmtInf
is the same as the first part of the secondPmtInf
. At this point, move<CdtTrfTxInf></CdtTrfTxInf>
from the secondPmtInf
right after the<CdtTrfTxInf></CdtTrfTxInf>
section from the firstPmtInf
. Then, remove the secondPmtInf
section.
- compare
So, at this moment, the xml would look like this:
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>123</MsgId>
<CreDtTm>321</CreDtTm>
<NbOfTxs>10</NbOfTxs>
<CtrlSum>18700.68</CtrlSum>
<InitgPty>
<Nm>some info</Nm>
</InitgPty>
</GrpHdr>
<PmtInf>
<!-- start -->
<PmtInfId>asd</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>false</BtchBookg>
<PmtTpInf>
<InstrPrty>NORM</InstrPrty>
<SvcLvl>
<Prtry>test</Prtry>
</SvcLvl>
</PmtTpInf>
<ReqdExctnDt>date</ReqdExctnDt>
<Dbtr>
<Nm>something</Nm>
<PstlAdr>
<AdrLine>addr 1</AdrLine>
</PstlAdr>
</Dbtr>
<!-- end -->
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
</PmtInf>
<PmtInf>
<!-- start -->
<PmtInfId>qwe</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>false</BtchBookg>
<PmtTpInf>
<InstrPrty>HIGH</InstrPrty>
<SvcLvl>
<Prtry>test</Prtry>
</SvcLvl>
</PmtTpInf>
<ReqdExctnDt>date</ReqdExctnDt>
<Dbtr>
<Nm>something</Nm>
<PstlAdr>
<AdrLine>addr 1</AdrLine>
</PstlAdr>
</Dbtr>
<!-- end -->
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
</PmtInf>
<PmtInf>
<!-- start -->
<PmtInfId>asd</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>false</BtchBookg>
<PmtTpInf>
<InstrPrty>NORM</InstrPrty>
<SvcLvl>
<Prtry>test</Prtry>
</SvcLvl>
</PmtTpInf>
<ReqdExctnDt>date</ReqdExctnDt>
<Dbtr>
<Nm>something</Nm>
<PstlAdr>
<AdrLine>addr 1</AdrLine>
</PstlAdr>
</Dbtr>
<!-- end -->
<CdtTrfTxInf>
<PmtId>
<InstrId>16082672122</InstrId>
<EndToEndId>16082672122</EndToEndId>
</PmtId>
<Amt>
<InstdAmt Ccy="RON">2159.41</InstdAmt>
</Amt>
<CdtrAgt>
<FinInstnId>
<BIC>some bic</BIC>
</FinInstnId>
</CdtrAgt>
</CdtTrfTxInf>
</PmtInf>
</CstmrCdtTrfInitn>
</Document>
- now repeat the process with the first
PmtInf
section and the third one and then with the forth one. If there are almost perfect matches, we should only have onePmtInf
tag with 4CdtTrfTxInf
tags inside it. - if, at some point, there's a mismatch (say, when comparing
<InstrPrty>NORM</InstrPrty>
from the firstPmtInf
with<InstrPrty>HIGH</InstrPrty>
from the thirdPmtInf
, leave thatPmtInf
section as it is and go to the next one. - after we finished to compare the first
PmtInf
s with allPmtInfs
above it, compare the secondPmtInf
with the third one and apply the same rules, then the third one with the forth one...and so on.
Now I might ask too much, but can this be done with XSLT ? I know I didn't try a thing but I just spent too much on trying to achieve this with simple Python string manipulations and it looks like XSLT transformations docs require some time to get used with the syntax.
I'm calling the script like this:
def parse_xml(file, output_path):
parser = ET.XMLParser(encoding='utf-8', recover=True)
dom = ET.parse(file, parser=parser)
xslt = ET.fromstring(TEMPLATE_XSLT) # TEMPLATE_XSLT contains the transformation
transform = ET.XSLT(xslt)
new_dom = transform(dom)
with open(output_path, 'wb') as xml_file:
xml_file.write(new_dom)