I have below xml that i am trying to load in to spark data frame.
<?xml version="1.0"?>
<env:ContentEnvelope xsi:schemaLocation="http">
<env:Body minVers="0.0" majVers="1" contentSet="Fundamental">
<env:ContentItem action="Overwrite">
<env:Data xsi:type="sr:FinancialSourceDataItem">
<sr:Source sourceId="344" organizationId="4295906830">
<sr:Auditor auditorId="3541">
<sr:Auditor auditorId="9574">
The main tag is <env:ContentEnvelope>
Then there are two part one header (<env:Header>
)and other is body (<env:Body
The details in the body like <fun:OrgId>
and <fun:DataPartitionId>
will be same for all the rows in the <env:Body
From this i want to create two data frame .
One for <sr:Source
and Second one for <sr:Auditor
For both data frames action="Overwrite"
will be same as a common column.
Also Because <sr:Auditor
is child of <sr:Source
so few columns like sourceId="344" organizationId="4295906830"
will be repeating in the <sr:Auditor
data frame.
This is what i have done so far to achieve this
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfHeader = dfContentEnvelope.withColumn("Header", (dfContentEnvelope("env:Header"))).select("Header.*")
val dfDataPartitionId =dfHeader.select("fun:DataPartitionId")
//val dfBody = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:Body").load("s3://trfsmallfffile/XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val dfType=dfContentItem.select("env:Data.*")
val srSource = dfType.withColumn("srSource", (dfType("sr:Source"))).select("srSource.*").drop("sr:Auditors").filter($"srSource".isNotNull)
val srSourceAuditor = dfType.withColumn("srSource", explode(dfType("sr:Source.sr:Auditors.sr:Auditor"))).select("srSource.*")
So my question is how can I get Parent dataframe for <sr:Source
and child dataframe for <sr:Auditor
with organizationId and sourceId from Parent to child dataframe?