I am trying to load sales data which is in XML format to the Hive table. Below is a small sample of the data.
I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.
Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.
Sample data:
<Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>
Desired Output
Table structure:
CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);
Output: enter image description here
Tried below query to read the data, It is not showing the results in the way i want.
select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;
Results:
Thanks for detailed solution. I tested it and it worked perfectly fine. I tried a similar approach to read the data from the XML directly with XML serde.
My challenges:
Below is the Hive table structure and the query I am using to read the data. I am able to explode the first level array (Bskt) successfully without any issues.
But when i try to explode the second level array (Itm) it returns NULL results for all the fields in 'Itm'.
Is there any issue with my query or the table structure itself?
Query:
1)For Bskt which works fine:
Results:
enter image description here 2) When trying two lateral view explode in a single query:
Results:
enter image description here
3) Query:
Error:
enter image description here
Explode for array object works like cross join. So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.
You can't map one object from array to another.
Actually you can by using
posexplode
which gives youindex
for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.Solution
posexplode
if you have less column to explode and array size is same. for your case this is not going to work. Sostruct
based on your xml. If you don't have much complex xml, you can achieve this. HoweverxmlSerde
is not as good asJSONserde
when it comes to converting file to complex data type.So in your case best solution would be.
NiFi
or some other technology for that.JSONserde
and load this file.JSON for Your XML
JsonSerde
might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.Hive Table
Create View