I have xml file having multiple rowstags. i need to convert this xml to proper dataframe. i have used spark-xml which is only handling single row tag.
xml data is below
<?xml version='1.0' encoding='UTF-8' ?>
<generic
xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000">
<HEADER compName="ABGROUP" dateCreated="2018-03-09T09:38:51"/>
<COVERSHEET>
<ESTIMATE_INFO estimateName="2016-09-28-133907" priceList="YHTRDF" laborEff="Restoration/Service/Remodel" claimNumber="Hdchtdhtdh" policyNumber="Utfhtdhtd" typeOfLoss="Collapse" causeOfLoss="Collapse" roofDamage="0" deprMat="1" deprNonMat="1" deprRemoval="1" deprOandP="1" deprTaxes="1" estimateType="Mixed"/>
<ADDRESSES>
<ADDRESS type="Property" street="Pkwy" city="Lehi" state="UT" zip="0000" primary="1"/>
</ADDRESSES>
<CONTACTS>
<CONTACT type="ClaimRep" name="Vytvyfv"/>
<CONTACT type="Estimator" name="Vytvyfv"/>
</CONTACTS>
<DATES loss="2016-09-28T19:38:23Z" inspected="2016-09-28T19:39:27Z" completed="2018-03-09T09:38:49Z" received="2016-09-28T19:39:24Z" entered="2016-09-28T19:39:07Z" contacted="2016-09-28T19:39:26Z"/>
</COVERSHEET>
<COVERAGES>
<COVERAGE coverageName="Dwelling" coverageType="0" id="1"/>
<COVERAGE coverageName="Other Structures" coverageType="1" id="2"/>
<COVERAGE coverageName="Contents" coverageType="2" id="3"/>
</COVERAGES>
<LINE_ITEM_DETAIL>
<COV_BREAKDOWN>
<COV_AMOUNTS desc="Dwelling"/>
<COV_AMOUNTS desc="Other Structures"/>
<COV_AMOUNTS desc="Contents"/>
</COV_BREAKDOWN>
</LINE_ITEM_DETAIL>
<RECAP_BY_ROOM>
<RECAP_GROUP desc="2016-09-28-133907"/>
</RECAP_BY_ROOM>
</generic>
I would suggest you to read it as one rowTag (generic element) and later explode according to your needs
First of all, attributes of the elements should not contain line delimiter so
should be
Once above amendment is done, you can read it using databricks xml as
which should give you
Inspecting the above dataframe, you can simplify it by doing the following
which should give you dataframe with schema as below
Now you can transform it into multiple rows depending to
COVERAGE
orCONTACT
orCOV_AMOUNTS
columns as they are the only columns that can be exploded to multiple rows.I hope the answer is helpful