I have a large table containing a userID
column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID
. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID
is contained in other parts of the table I would like to extract those rows too.
I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source
TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userID
s selected too.