I was looking for an argmax() type function in HiveQL and found an almost undocumented feature in their bug tracker (https://issues.apache.org/jira/browse/HIVE-1128) which does what I want by taking max() of a struct, which finds the maximum based on the first element and returns the whole struct. (Actually, maybe the max() would break ties by looking at subsequent elements? I don't know.)
Anyway, if I essentially want to select the whole row that contains the max value of some column, I can pack up the row into a struct with the comparison value first, and then extract the maximal struct back to reconstruct the best row. But the syntax is repetitive and ugly. Is there a better way to do it? (I guess a self-join is another option, but seems less elegant and I'd guess less efficient?)
Example table:
id,val,key
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y
HiveQL:
select
max(struct(val, key, id)).col3 as max_id, -- for illustration, grouping on id anyway
max(struct(val, key, id)).col1 as max_val,
max(struct(val, key, id)).col2 as max_key
from test_argmax
group by id
Result:
max_id,max_val,max_key
1,3,C
2,3,W
One possibility is a nested query:
but you don't seem to be able to select best.* (it thinks that's a table alias) so need to list all the struct members explicitly. It seems like the inline() function - which explodes an array of structs into a table - does a lot of what you want, but not quite: I want to explode a column of structs into a table.