How to get array/bag of elements from Hive group b

2019-04-04 14:43发布

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-

Imagine a table named 'sample_table' with two columns as below:-

I want to write Hive Query that will give the below output:-

001 [111, 222, 123]
002 [222, 333]
003 [555]

In Pig, this can be very easily achieved by something like this:-

grouped_relation = GROUP sample_table BY F1;

Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.

标签： sql hadoop hive apache-pig bigdata

2条回答

倾城　Initia

2楼-- · 2019-04-04 15:11

The built in aggregate function collect_set (doumented here) gets you almost what you want. It would actually work on your example input:

SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1

Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.

0人赞添加讨论(0) 举报

女痞

3楼-- · 2019-04-04 15:15

collect_set actually works as expected since a set as per definition is a collection of well defined and distinct objects i.e. objects occur exactly once or not at all within a set.

0人赞添加讨论(0) 举报

How to get array/bag of elements from Hive group b

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间