Better HiveQL syntax to explode a column of struct

2019-07-11 21:51发布

问题:

I was looking for an argmax() type function in HiveQL and found an almost undocumented feature in their bug tracker (https://issues.apache.org/jira/browse/HIVE-1128) which does what I want by taking max() of a struct, which finds the maximum based on the first element and returns the whole struct. (Actually, maybe the max() would break ties by looking at subsequent elements? I don't know.)

Anyway, if I essentially want to select the whole row that contains the max value of some column, I can pack up the row into a struct with the comparison value first, and then extract the maximal struct back to reconstruct the best row. But the syntax is repetitive and ugly. Is there a better way to do it? (I guess a self-join is another option, but seems less elegant and I'd guess less efficient?)

Example table:

id,val,key
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y

HiveQL:

select 
  max(struct(val, key, id)).col3 as max_id,  -- for illustration, grouping on id anyway
  max(struct(val, key, id)).col1 as max_val,
  max(struct(val, key, id)).col2 as max_key
from test_argmax
group by id

Result:

max_id,max_val,max_key
1,3,C
2,3,W

回答1:

One possibility is a nested query:

select
  best.id as id,
  best.val as val,
  best.key as key
from (
  select 
    max(struct(val, key, id)) as best 
  from test_argmax
  group by id
)

but you don't seem to be able to select best.* (it thinks that's a table alias) so need to list all the struct members explicitly. It seems like the inline() function - which explodes an array of structs into a table - does a lot of what you want, but not quite: I want to explode a column of structs into a table.