I have an Hive table made of user_id and item_id (id of items that have been purchased by the user). I want to get a list of all the users who purchased item 1 but not item 2 and 3.
To do this I wrote the simple query:
SELECT user_id, collect_set(item_id) itemslist FROM mytable
WHERE item_id in (1, 2)
GROUP BY user_id
HAVING -- what should I put here???
As you can see, I don't know how to check whether the array itemslist contains 1 and not 2.
How do you do this? If there is some more efficient way can you please tell me both (or more) methods?
There are some collection functions in Hive `(See collection functions here : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF ) which can use here.
You can use the
array_contains(Array<T>, value)
function to check if item 1 is present and thesize(Array<T>)
function to make sure the length is 1. If both conditions are satisfied, you will get the desired output.