可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have following table in hive
user-id, user-name, user-address,clicks,impressions,page-id,page-name
I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]
I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.
How can we do this using HIve UDF ?
回答1:
You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank(user-id) as rank, clicks
FROM mytable
DISTRIBUTE BY page-id, user-id
SORT BY page-id, user-id, clicks desc
) a
WHERE rank < 5
ORDER BY page-id, rank
回答2:
Revised answer, fixing the bug as mentioned by @Himanshu Gahlot
SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
SELECT page-id, user-id, clicks FROM mytable
DISTRIBUTE BY page-id
SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank
Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)
回答3:
As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:
SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank()
over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks
FROM my table
) ranked_mytable
WHERE ranked_mytable.rank < 5
ORDER BY page-id, rank
No UDF required, and only one subquery! Also, all of the rank logic is localized.
You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.
回答4:
You can use each_top_k
function of hivemall
for an efficient top-k computation on Apache Hive.
select
page-id,
user-id,
clicks
from (
select
each_top_k(5, page-id, clicks, page-id, user-id)
as (rank, clicks, page-id, user-id)
from (
select
page-id, user-id, clicks
from
mytable
DISTRIBUTE BY page-id SORT BY page-id
) t1
) t2
order by page-id ASC, clicks DESC
The each_top_k
UDTF is very fast when compared to other methods running top-k queries (e.g., distributed by/rank
) in Hive because it does not hold the whole ranking for the intermediate result.
回答5:
Let us say your data looks like following :
page-id user-id clicks
page1 user1 10
page1 user2 10
page1 user3 9
page1 user4 8
page1 user5 7
page1 user6 7
page1 user7 6
page1 user8 5
page2 user1 20
page2 user2 19
page2 user3 18
Below Query will give you :
SELECT page-id, user-id, clicks, rank
FROM (
SELECT page-id, user-id, rank()
over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks
FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5
Result :
page-id user-id clicks rank
page1 user1 10 1
page1 user2 10 1
page1 user3 9 3
page1 user4 8 4
page1 user5 7 5
page1 user6 7 5
page2 user1 20 1
page2 user2 19 2
page2 user3 18 3
So, for page1 you are getting 6 users, as users with same number of clicks are ranked same.
But, if you are looking for exactly 5 users, and pick randomly in case multiple users fall in same rank. You can use the below query
SELECT page-id, user-id, clicks, rank
FROM (
SELECT page-id, user-id, row_number()
over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks
FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5
Result :
page-id user-id clicks rank
page1 user1 10 1
page1 user2 10 2
page1 user3 9 3
page1 user4 8 4
page1 user5 7 5
page2 user1 20 1
page2 user2 19 2
page2 user3 18 3