Collect to a Map in Hive

2019-04-27 09:42发布

问题:

I have a Hive table such as

id  |  value
-------------
A      1
A      2
B      3
A      4
B      5

Essentially, I want to mimic Python's defaultdict(list) and create a map with id as the keys and value as the values.

Query:

select COLLECT_TO_A_MAP(id, value)
from table

Output:

{A:[1,2,4], B:[3,5]}

I tried using klout's CollectUDAF() but it appears this will not append the values to an array, it will just update them. Any ideas?

EDIT: Here is a more detailed description so I can avoid answers referencing that I try functions in the Hive documentation. Suppose I have a table

num    |id    |value
____________________
1       A      1
1       A      2
1       B      3
2       A      4
2       B      5
2       B      6

What I am looking for is for a UDAF that provides this output

num     |new_map
________________________
1       {A:[1,2], B:[3]}
2       {A:[4], B:[5,6]}

To this query

select num
      ,COLLECT_TO_A_MAP(id, value) as new_map
from table
group by num

There is a workaround to achieve this. It can be mimicked by using Klout's (see above referenced UDAF) CollectUDAF() in a query such as

add jar '~/brickhouse/target/brickhouse-0.6.0.jar'
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select num
       ,collect(id_array, value_array) as new_map
from (
      select collect_list(id) as id_array
            ,collect_list(value) as value_array
            ,num
      from table
      group by num
     ) A
group by num

However, I would rather not write a nested query.

EDIT #2

(As referenced in my original question) I have already tried using Klout's CollectUDAF(), even in the instance where you pass it two parameter and it creates a map. The output from that is (if applied to the dataset in my 1st edit)

1    {A:2, B:3}
2    {A:4, B:6}

As stated in my original question, it doesn't collect the values to an array it just collects the last one (or updates the array).

回答1:

Use the collect UDF in Brickhouse (http://github.com/klout/brickhouse )

It is exactly what you need. Brickhouse's 'collect' returns a list if one parameter is used, and a map if two parameters are used.



回答2:

the CollectUDAF in Brickhouse (http://github.com/klout/brickhouse ) will get you there.

regarding your comment EDIT #2:

first, collect the values to a list, then collect the k,v pairs to a map:

select
    num,
    collectUDAF(id, values) as new_map
from
    (
    SELECT
        num,
        id,
        collect_set(value) as values
    FROM
        tbl
    GROUP BY
        num,
        id
    ) as sub
GROUP BY
    num

will return

num  | new_map
________________________
1      {A:[1,2], B:[3]}
2      {A:[4], B:[5,6]}


回答3:

If you don't care about the order in which the values appear, you could use the collect_set() UDAF that comes with Hive.

SELECT id, collect_set(value) FROM table GROUP BY id;

This should solve your issue.



回答4:

Your current query groups by num in both the inner and outer query -- you need to group by id in the inner query to accomplish what you're trying to do.



回答5:

https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CollectUDAF.java#L55

see brickhouse udaf,when args num larger than 1, MapCollectUDAFEvaluator would be used.

add jar */brickhouse.jar ;
create temporary function collect  as 'brickhouse.udf.collect.CollectUDAF';
select 
collect(a,b)
from( select 1232123 a,21 b 
    union all select 123 a,23 b)a;


result:{1232123:21,123:23}


标签: hadoop hive