Random sample table with Hive, but including match

2019-06-13 19:39发布

问题:

I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.

I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:

SELECT * FROM source 
TABLESAMPLE (1 PERCENT) s;

but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.

回答1:

You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.

select c.*
from 
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
    from
    (select userID 
    from mytable 
    group by userID
    ) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;

This is set up for entity level modeling purposes, which is why I have test and train as types.