Hadoop-Hive | Convert single row columns into mult

2019-06-14 18:26发布

问题:

I have a Hive table like this

 Created_date   ID1 Name1 Age1 Gender1 Name2 ID2 Age2 Gender2 ID3 Name3  Age3 Gender3....
  2014-02-01    1   ABC   21    M      MNP    2  22   F       3   XYZ    25   M
  2015-06-06    11  LMP   31    F      PLL   12  42   M       13  UIP    37   F

This table may have any no. of repeated set of the 4 columns pair. The sequence of these 4 columns is also not fix and there may be 1 or 2 more columns which are not repeated like created_date

I need to convert the above table into a new Hive table having only 4 columns ID, Name, Age and Gender like below. I dont need created_date column in my result table.

ID  Name  Age Gender
1   ABC   21  M
2   MNP   22  F
3   XYZ   25  M
11  LMP   31  F
12  PLL   42  M
13  UIP   37  F

Please suggest how can I achieve this in Hive.

回答1:

select  inline
        (
            array
            (
                struct(ID1,Name1,Age1,Gender1)
               ,struct(ID2,Name2,Age2,Gender2)
               ,struct(ID3,Name3,Age3,Gender3)
             )
        ) as (ID,Name,Age,Gender)

 from   mytable

+----+------+-----+--------+
| id | name | age | gender |
+----+------+-----+--------+
|  1 | ABC  |  21 | M      |
|  2 | MNP  |  22 | F      |
|  3 | XYZ  |  25 | M      |
| 11 | LMP  |  31 | F      |
| 12 | PLL  |  42 | M      |
| 13 | UIP  |  37 | F      |
+----+------+-----+--------+


回答2:

Your problem can be solved using Hive UDTF. User defined tabular function(UDTF) works on one row as input and returns multiple rows as output.

For example:

We have a file in which we have multiple records. Each record contains a list of customer, merchant1, merchant2. We want to get a list of all the distinct customers associated with a customer.

Sample Input: 
Cust1, Merchant1, Merchant2
Cust2, Merchant1, Merchant2

Expected Output
Cust1, Merchant1
Cust1, Merchant2
Cust2, Merchant1
Cust2, Merchant2

Now you can write a UDTF named Explode that would get the expected output and can be used in a query like :

Select Explode(customer, firstMerchant, secondMerchant) 
AS (cust, merchant) 
from Merchants

You can refer : https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF