Hive SELECT statement to create an ARRAY of STRUCT

2020-03-06 03:30发布

问题:

I'm having trouble selecting into an ARRAY of STRUCTS in Hive.

My source table looks like this:

+-------------+--+
|    field    |
+-------------+--+
| id          |
| fieldid     |
| fieldlabel  |
| fieldtype   |
| answer_id   |
| unitname    |
+-------------+--+

This is survey data, where the id is the survey id, the four fields in the middle are response data, and the unitname is the the business unit that the survey pertains to.

I need to create an array of structs for all of the answers for each survey id. I thought this would work, but it doesn't:

select id, 
array( 
    named_struct(
        "field_id",
        fieldid,
        "field_label",
        fieldlabel,
        "field_type",
        fieldtype,
        "answer_id",
        answer_id,)) as answers,
unitname
from new_answers;

What that returns is each survey answer (field_id) as an array of one struct for that answer, like this:

id | answers | unitname
1 | [{"field_id":175877,"field_label":"Comment","field_type":"COMMENT","answer_id":8990947803}] | Location1
2 | [{"field_id":47824,"field_label":"Language","field_type":"MULTIPLE_CHOICE","answer_id":8990950069}] | Location2
2 | [{"field_id":48187,"field_label":"Language Type","field_type":"MULTIPLE_CHOICE","answer_id":8990950070}] | Location2
2 | [{"field_id":47829,"field_label":"Trans #","field_type":"TEXT","answer_id":8990950071}] | Location2

But what I need to get to is this:

id | answers | unitname    
1 | [{"field_id":175877,"field_label":"Comment","field_type":"COMMENT","answer_id":8990947803}] | Location1
2 | [{"field_id":47824,"field_label":"Language","field_type":"MULTIPLE_CHOICE","answer_id":8990950069},
   {"field_id":48187,"field_label":"Language Type","field_type":"MULTIPLE_CHOICE","answer_id":8990950070},
   {"field_id":47829,"field_label":"Trans #","field_type":"TEXT","answer_id":8990950071}] | Location2

I've searched and searched, but all the answers I'm finding seem to have to do with using INSERT INTO....VALUES() queries. I have a table structure already; I just can't get the ARRAY to ARRAY up the way it should.

Any help would be most appreciated.

For reproduction purposes, if need be:

CREATE TABLE `new_answers`( 
`id` bigint,
`fieldid` bigint,
`fieldlabel` string,
`fieldtype` string,
`answer_id` bigint,
`unitname` string)

INSERT INTO new_answers VALUES
(1,175877,"Comment","COMMENT",8990947803,"Location1"),
(2,47824,"Language","MULTIPLE_CHOICE",8990950069,"Location2"),
(2,48187,"Language Type","MULTIPLE_CHOICE",8990950070,"Location2"),
(2,47829,"Trans #","TEXT",8990950071,"Location2");

回答1:

The functionality that you seem to be looking for is to collect the structs into an array. Hive comes with two functions for collecting things into arrays: collect_set and collect_list. However, those functions only work to create arrays of basic types.

The jar for the brickhouse project (https://github.com/klout/brickhouse/wiki/Downloads) provides a number of features, including the ability to collect complex types.

add jar hdfs://path/to/your/jars/brickhouse-0.6.0.jar

Then you can add the collect function using whatever name you like:

create temporary function collect_struct as 'brickhouse.udf.collect.CollectUDAF';

The following query:

select id
     , collect_struct( 
         named_struct(
           "field_id", fieldid,
           "field_label", fieldlabel,
           "field_type", fieldtype,
           "answer_id", answer_id)) as answers
     , unitname
  from new_answers
 group by id, unitname
;

Provides the following result:

id  answers unitname
1   [{"field_id":175877,"field_label":"Comment","field_type":"COMMENT","answer_id":8990947803}] Location1
2   [{"field_id":47824,"field_label":"Language","field_type":"MULTIPLE_CHOICE","answer_id":8990950069},{"field_id":48187,"field_label":"Language Type","field_type":"MULTIPLE_CHOICE","answer_id":8990950070},{"field_id":47829,"field_label":"Trans #","field_type":"TEXT","answer_id":8990950071}]    Location2