How to load xml file into Hive

2020-03-05 06:22发布

问题:

Im working on Hive tables im having the following problem. I am having more than 1 billion of xml files in my HDFS. What i want to do is, Each xml file having the 4 different sections. Now i want to split and load the each part in the each table for every xml file

Example :

            <?xml version='1.0' encoding='iso-8859-1'?>

            <section1>
                <id> 1233222 </id>
               // having lot of xml tages 
            </section1>

            <section2>
               // having lot of xml tages 
            </section2>

            <section3>
               // having lot of xml tages 
            </section3>

            <section4>
               // having lot of xml tages 
            </section4>

            </xml>

And i have the four tables

        section1Table

        id       section1    // fields 

        section2Table

        id       section2

        section3Table 

        id       section3

        section4Table

        id       section4

Now i want to split and load the data into each table.

How can i achieve this . Can anyone help me

Thanks

UPDATE

I have tried the following

CREATE EXTERNAL TABLE test(name STRING) LOCATION '/user/sornalingam/zipped/output/Tagged/t1';\


SELECT xpath (name, '//section1') FROM test LIMIT 1 ;

but i got the following error

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"name":"<?xml version='1.0' encoding='iso-8859-1'?>"}

回答1:

You have several options:

  • Load the XML into a Hive table with a string column, one per row (e.g. CREATE TABLE xmlfiles (id int, xmlfile string). Then use an XPath UDF to do work on the XML.
  • Since you know the XPath's of what you want (e.g. //section1), follow the instructions in the second half of this tutorial to ingest directly into Hive via XPath.
  • Map your XML to Avro as described here because a SerDe exists for seamless Avro-to-Hive mapping.
  • Use XPath to store your data in a regular text file in HDFS and then ingest that into Hive.

It depends on your level of experience and comfort with these approaches.



回答2:

Use this:

CREATE EXTERNAL TABLE test(name STRING) LOCATION '/user/sornalingam/zipped/output/Tagged/t1'

tblproperties ("skip.header.line.count"="1", "skip.footer.line.count"="1");

And then use xpath function



标签: xml hadoop hive