Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregate all of the items in a column that have the same key into an array, with duplicates.
I.E.:
hash_id | num_of_cats
=====================
ad3jkfk 4
ad3jkfk 4
ad3jkfk 2
fkjh43f 1
fkjh43f 8
fkjh43f 8
rjkhd93 7
rjkhd93 4
rjkhd93 7
should return:
hash_agg | cats_aggregate
===========================
ad3jkfk Array<int>(4,4,2)
fkjh43f Array<int>(1,8,8)
rjkhd93 Array<int>(7,4,7)
There is nothing built in, but creating user defined functions, including aggregates, isn't that bad. The only rough part is trying to make them type generic, but here is a collect example.
Then in hive, just issue
add jar Whatever.jar;
andCREATE TEMPORARY FUNCTION collect_all AS 'com.example.CollectAll';
You should them be able to use it as expected.It's worth noting that the order of the elements should be considered undefined, so if you intend to use this to feed information into n_grams you might need to expand it a bit to sort the data as needed.
As of hive 0.13, there is a built-in UDAF called
collect_list()
that achieves this. See here.Here is the exact hive query that does this job (works only in hive > 0.13):
SELECT hash_id, collect_set( num_of_cats) FROM GROUP BY hash_id;
Modified Jeff Mc's code to remove the restriction (presumably inherited from collect_set) that input must be primitive types. This version can collect structs, maps and arrays as well as primitives.
Workaround to collect struct
suppose you have a table
now create another table as
insert query
now create another table at same location as tablename
when you select from tablename_final you will get the desired output
Check out the Brickhouse collect UDAF ( http://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CollectUDAF.java )
It also supports collecting into a map. Brickhouse also contains many useful UDF's not in the standard Hive distribution.