I have following table in hive
user-id, user-name, user-address,clicks,impressions,page-id,page-name
I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]
I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.
How can we do this using HIve UDF ?
As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:
No UDF required, and only one subquery! Also, all of the rank logic is localized.
You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.
You can use
each_top_k
function ofhivemall
for an efficient top-k computation on Apache Hive.The
each_top_k
UDTF is very fast when compared to other methods running top-k queries (e.g.,distributed by/rank
) in Hive because it does not hold the whole ranking for the intermediate result.Revised answer, fixing the bug as mentioned by @Himanshu Gahlot
Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)
You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
Let us say your data looks like following :
Below Query will give you :
Result :
So, for page1 you are getting 6 users, as users with same number of clicks are ranked same.
But, if you are looking for exactly 5 users, and pick randomly in case multiple users fall in same rank. You can use the below query
Result :