load struct or any other complex data type in hive

2019-06-02 09:05发布

问题:

I have a .xlsx file which contains data some thing like the below image, am trying to create using the below create query

CREATE TABLE aus_aboriginal(
    code int,
    area_name string,
    male_0_4 STRUCT<num:double, total:double, perc:double>,
    male_5_9 STRUCT<num:double, total:double, perc:double>,
    male_10_14 STRUCT<num:double, total:double, perc:double>,
    male_15_19 STRUCT<num:double, total:double, perc:double>,
    male_20_24 STRUCT<num:double, total:double, perc:double>,
    male_25_29 STRUCT<num:double, total:double, perc:double>,
    male_30_34 STRUCT<num:double, total:double, perc:double>,
    male_35_39 STRUCT<num:double, total:double, perc:double>,
    male_40_44 STRUCT<num:double, total:double, perc:double>,
    male_45_49 STRUCT<num:double, total:double, perc:double>,
    male_50_54 STRUCT<num:double, total:double, perc:double>,
    male_55_59 STRUCT<num:double, total:double, perc:double>,
    male_60_64 STRUCT<num:double, total:double, perc:double>,
    male_above_65 STRUCT<num:double, total:double, perc:double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

When I load the data into it I get nulls What am I missing in CREATE TABLE..?

回答1:

On using complex types like struct, it is recommended to use a unique delimiter for collection than the one used for fields (columns). Consider a csv file in below format where “,” comma separator is used. Input.csv

Code, area_name,num,total,perc,num,total,perc,num,total,perc 1100,Albury,90,444,17.4,73,546,13.4,86,546,15.8

1111,armid,40,404,14.4,97,701,13.8,76,701,10.8

Expected result is to create a complex type out of fields (num, total and perc):

1100,Albury,struct<90,444,17.4>,struct<73,546,13.4>,struct<86,546,15.8>

1111,armid, struct<40,404,14.4>, struct<97,701,13.8>,struct<76,701,10.8>

when we try to create a complex type out of fields (num, total and perc) in this case using following hive query, we will get multiple null values in the table since the same “,” comma delimiter is used for both fields and collections, so Hive query failed to segregate the data as we required.

Hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY  ',' COLLECTION ITEMS TERMINATED BY ',' LOCATION '/csv';

Output:

1100 Albury {"num":90.0,"total":null,"perc":null} {"num":444.0,"total":nul l,"perc":null} {"num":17.4,"total":null,"perc":null}

1111 armid {"num":40.0,"total":null,"perc":null} {"num":404.0,"total":nul l,"perc":null} {"num":14.4,"total":null,"perc":null}

Time taken: 0.15 seconds, Fetched: 2 row(s)

I am suspecting that you are facing this problem.

Usage of Struct Now consider the input file having data in below format, where “,” comma delimiter is used for fields and for collection items “#” is used as delimiter.

1100,Albury,90#444#17.4,73#546#13.4,86#546#15.8

1111,armid,40#404#14.4,97#701#13.8,76#701#10.8

In this case, we can successfully create a table with complex type by specifying # as delimiter for collection items and , for fields. Please check below hive query.

hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY  ',' COLLECTION ITEMS TERMINATED BY '#' LOCATION '/csv';

Output:

hive> select * from aus_aboriginal;

1100 Albury {"num":90.0,"total":444.0,"perc":17.4} {"num":73.0,"total":546. 0,"perc":13.4} {"num":86.0,"total":546.0,"perc":15.8}

1111 armid {"num":40.0,"total":404.0,"perc":14.4} {"num":97.0,"total":701. 0,"perc":13.8} {"num":76.0,"total":701.0,"perc":10.8}

Time taken: 0.146 seconds, Fetched: 2 row(s)

Similar approach should be taken for other complex types as well, refer below link for more information.

Reference: http://edu-kinect.com/blog/2014/06/16/hive-complex-data-types-with-examples/



回答2:

You add delimiter for struct type also in CREATE statement as below:

CREATE TABLE aus_aboriginal( code INT, area_name STRING, 
male_0_4 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_5_9 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_10_14 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_15_19 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_20_24 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_25_29 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_30_34 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_35_39 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_40_44 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_45_49 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_50_54 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_55_59 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_60_64 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_above_65 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
COLLECTION ITEMS TERMINATED BY ':';

You can have one sample query like :

SELECT code, male_0_4.num, male_0_4.total, male_0_4.perc FROM aus_aboriginal;


回答3:

Create a hive table using:

CREATE TABLE `complex_data_types`(
  `col1` array<string>, 
  `col2` map<int,string>, 
  `col3` struct<c1:smallint,c2:varchar(30)>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  COLLECTION ITEMS TERMINATED BY '&' 
  MAP KEYS TERMINATED BY '#';

Note: union can be taken in the same way

Create a csv file:

arr1&arr2,101#map1&102#map2,11&varchar_1
arr3&arr4,103#map3&104#map4,12&varchar_2

Load this data in hive table:

LOAD DATA LOCAL INPATH '/home/dev/complex_data.csv' into table complex_data_types;

Note: Assuming file is located at /home/dev/complex_data.csv