I have a Hive table like this
Created_date ID1 Name1 Age1 Gender1 Name2 ID2 Age2 Gender2 ID3 Name3 Age3 Gender3....
2014-02-01 1 ABC 21 M MNP 2 22 F 3 XYZ 25 M
2015-06-06 11 LMP 31 F PLL 12 42 M 13 UIP 37 F
This table may have any no. of repeated set of the 4 columns pair. The sequence of these 4 columns is also not fix and there may be 1 or 2 more columns which are not repeated like created_date
I need to convert the above table into a new Hive table having only 4 columns ID, Name, Age and Gender like below. I dont need created_date column in my result table.
ID Name Age Gender
1 ABC 21 M
2 MNP 22 F
3 XYZ 25 M
11 LMP 31 F
12 PLL 42 M
13 UIP 37 F
Please suggest how can I achieve this in Hive.
select inline
(
array
(
struct(ID1,Name1,Age1,Gender1)
,struct(ID2,Name2,Age2,Gender2)
,struct(ID3,Name3,Age3,Gender3)
)
) as (ID,Name,Age,Gender)
from mytable
+----+------+-----+--------+
| id | name | age | gender |
+----+------+-----+--------+
| 1 | ABC | 21 | M |
| 2 | MNP | 22 | F |
| 3 | XYZ | 25 | M |
| 11 | LMP | 31 | F |
| 12 | PLL | 42 | M |
| 13 | UIP | 37 | F |
+----+------+-----+--------+
Your problem can be solved using Hive UDTF. User defined tabular function(UDTF) works on one row as input and returns multiple rows as output.
For example:
We have a file in which we have multiple records. Each record contains a list of customer, merchant1, merchant2. We want to get a list of all the distinct customers associated with a customer.
Sample Input:
Cust1, Merchant1, Merchant2
Cust2, Merchant1, Merchant2
Expected Output
Cust1, Merchant1
Cust1, Merchant2
Cust2, Merchant1
Cust2, Merchant2
Now you can write a UDTF named Explode that would get the expected output and can be used in a query like :
Select Explode(customer, firstMerchant, secondMerchant)
AS (cust, merchant)
from Merchants
You can refer :
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF